Re: x86 status was Re: -mm merge plans for 2.6.23

Previous thread: Re: understanding firmware loader for speedtouch (kernel 2.6.21.5) by mikie on Tuesday, July 10, 2007 - 3:54 am. (14 messages)

Next thread: [RFC][PATCH v2 -mm 0/9] netconsole: Multiple targets and dynamic reconfigurability by Satyam Sharma on Tuesday, July 10, 2007 - 5:19 am. (36 messages)
To: <linux-kernel@...>
Date: Tuesday, July 10, 2007 - 4:31 am

When replying, please rewrite the subject suitably and try to Cc: the
appropriate developer(s).

add-lzo1x-algorithm-to-the-kernel.patch
make-common-helpers-for-seq_files-that-work-with-list_head-s.patch
lots-of-architectures-enable-arbitary-speed-tty-support.patch

Merge

serial-assert-dtr-for-serial-console-devices.patch

Don't know. I worry about Russell's concern (see the changelog)

git-acpi-s390-struct-bin_attribute-changes.patch
cpuidle-add-rating-to-the-governors-and-pick-the-one-with-highest-rating-by-default-fix.patch
exit-acpi-processor-module-gracefully-if-acpi-is-disabled.patch
fix-empty-macros-in-acpi.patch
drivers-acpi-sbsc-remove-dead-code.patch
acpi-enable-c3-power-state-on-dell-inspiron-8200.patch
drivers-acpi-pci_linkc-lower-printk-severity.patch

Sent to lenb

working-3d-dri-intel-agpko-resume-for-i815-chip.patch

Sent to davej

cifs-use-simple_prepare_write-to-zero-page-data.patch
cifs-zero_user_page-conversion.patch

Sent to sfrench

bugfix-cpufreq-in-combination-with-performance-governor.patch
restore-previously-used-governor-on-a-hot-replugged-cpu.patch

Sent to davej

kcopyd-use-mutex-instead-of-semaphore.patch

Sent to agk

powerpc-promc-remove-undef-printk.patch
8xx-mpc885ads-pcmcia-support.patch
dts-kill-hardcoded-phandles.patch
ppc-remove-dead-code-for-preventing-pread-and-pwrite-calls.patch
viotape-use-designated-initializers-for-fops-member.patch
make-drivers-char-hvc_consoleckhvcd-static.patch
powerpc-enable-arbitary-speed-tty-ioctls-and-split.patch
powerpc-tlb_32c-build-fix.patch
sky-cpu-and-nexus-code-style-improvement.patch
sky-cpu-and-nexus-include-ioh.patch
sky-cpu-and-nexus-check-for-platform_get_resource-ret.patch
sky-cpu-and-nexus-check-for-create_proc_entry-ret-code.patch
sky-cpu-use-c99-style-for-struct-init.patch

Sent to paulus

revert-gregkh-driver-block-device.patch
driver-core-check-return-code-of-sysfs_create_link.patch
driver-core-coding-style-cleanup.patch
pm-do-not-use-saved_state-from-str...

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>
Date: Tuesday, July 10, 2007 - 4:09 pm

As far as I can tell the antifrag patches are stable and are significantly
enhancing various aspects of the VM and also make it more reliable. SLUB
can use it to increase scalability. MM has been using order 3 allocs via
SLUB for months now without a problem. Without the antifrag patches order
1 allocs could cause OOMs.

It opens the door for functionality that we wanted for a long time such a
memory unplug etc.

-

To: Christoph Lameter <clameter@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 5:42 am

SLUB using high orders without page allocation failures do depend on
two very questionable patches that I brought to attention in my inital
merge mail. If grouping pages by mobility goes through, I'll be revisiting
that properly to make sure it can work without deadlocking ever under any
circumstances. Right now, it theoritically could livelock although I've
never been able to reproduce it.

The patches as they are will work for high-order allocations if you are
willing to wait and reclaim memory. The more stressful users need more
effort but it's already been shown that it can be made work with one

And I want to avoid a catch-22 here where the features that depend on
grouping pages by mobility have to exist before grouping pages by
mobility is pushed through.

I would like the patches to go through on the grounds that higher order
allocations can succeed. However, I am also happy to say that order-0
pages should be used as much as possible, that case should always be
made as fast as possible and the world must not end if a high-order
allocation fails.

--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-

To: Mel Gorman <mel@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 1:49 pm

SLUB can easily be made to not use higher order pages.

If the SLUB mobility patches are not merged then higher order page use
can be explicitly enabled via passing the following to the kernel on boot

slub_max_order=<desired max order>

If they are merged then the higher order page use can be disabled in case
of trouble via

slub_max_order=0
-

To: Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <tilman@...>
Date: Friday, July 13, 2007 - 5:46 am

Or replace by his suggestion patch ( http://lkml.org/lkml/2007/5/31/222 )

Jan
--
-

To: Jan Engelhardt <jengelh@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, July 13, 2007 - 7:09 pm

)

That posting was just a change proposal for the drivers/isdn/Kconfig
part, not a complete replacement for the entire patch. If you'd care to
reissue that patch with the modification I proposed, I'll gladly ack it.
Alternatively I can also send a full replacement patch if you prefer.

Regards,
Tilman

--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)

To: Tilman Schmidt <tilman@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, July 14, 2007 - 6:02 am

Since I did not really see much of a difference between our two
approaches, I'd be grateful if you could send a full replacement in
the hopes that I see the global picture.

Thanks,
Jan
--
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>
Date: Tuesday, July 10, 2007 - 5:04 am

Here's some fix:

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>

---
arch/x86_64/Kconfig | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc6/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.22-rc6.orig/arch/x86_64/Kconfig
+++ linux-2.6.22-rc6/arch/x86_64/Kconfig
@@ -753,11 +753,11 @@ config DMAR
depends on PCI_MSI && ACPI && EXPERIMENTAL
default y
help
- DMA remapping(DMAR) devices support enables independent address
- translations for Direct Memory Access(DMA) from Devices.
+ DMA remapping (DMAR) devices support enables independent address
+ translations for Direct Memory Access (DMA) from devices.
These DMA remapping devices are reported via ACPI tables
- and includes pci device scope covered by these DMA
- remapping device.
+ and include PCI device scope covered by these DMA
+ remapping devices.

config DMAR_GFX_WA
bool "Support for Graphics workaround"
@@ -765,9 +765,9 @@ config DMAR_GFX_WA
default y
help
Current Graphics drivers tend to use physical address
- for DMA and avoid using DMA api's. Setting this config
+ for DMA and avoid using DMA APIs. Setting this config
option permits the IOMMU driver to set a unity map for
- all the OS visible memory. Hence the driver can continue
+ all the OS-visible memory. Hence the driver can continue
to use physical addresses for DMA.

config DMAR_FLOPPY_WA
@@ -775,10 +775,10 @@ config DMAR_FLOPPY_WA
depends on DMAR
default y
help
- Floppy disk drivers are know to by pass dma api calls
- their by failing to work when IOMMU is enabled. This
- work around will setup a 1 to 1 mappings for the first
- 16M to make floppy(isa device) work.
+ Floppy disk drivers are know to bypass DMA API calls
+ thereby failing to work when IOMMU is enabled. This
+ workaround will setup a 1:1 mapping for the first
+ 16M to make floppy (an ISA device...

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <tglx@...>, <jeremy@...>, Tim Hockin <thockin@...>, <jesse.barnes@...>
Date: Wednesday, July 11, 2007 - 8:43 am

Xen is probably going to be merged. I'm still not fully happy
about the review status of the drivers and xenbus, but there doesn't seem
to be much value in delaying it further.

I'll consolidate the fixes and fixes-to-fixes.

It's still not clear to me this is any useful. The current code
can run a program on MCE which should be really fast enough

Might need more testing?

I'm sceptical about the dynticks code. It just rips out the
x86-64 timing code completely, which needs a lot more review and testing.
Probably not .23

-Andi
-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <tglx@...>, <jeremy@...>, Tim Hockin <thockin@...>, <jesse.barnes@...>
Date: Thursday, July 12, 2007 - 3:33 pm

^^^ That patch was supposed to be merged for 2.6.22 (you told me you
forgot to merge it) and has been for a long time in mm. Does it now
need to be rereviewed for 2.6.23? The other pieces of the quicklist patch
for core and other arches were merged for 2.6.22.
-

To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <tglx@...>, <jeremy@...>, Tim Hockin <thockin@...>, <jesse.barnes@...>
Date: Thursday, July 12, 2007 - 4:38 pm

It's just on the normal re-review list. But it'll likely go in.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <tglx@...>, Tim Hockin <thockin@...>, <jesse.barnes@...>, Adrian Bunk <bunk@...>, dave young <hidave.darkstar@...>
Date: Wednesday, July 11, 2007 - 2:14 pm

The first and third of these are just simple Kconfig updates, and the
middle one just updates the list of symbols which shouldn't be warned
about in CONFIG_RELOCATABLE's absolute symbol check. They're completely

This appears to fix a real bug; the only question is whether x86-64
needs the same treatment. I'm not sure if the original bug reporter
(dave young) has confirmed it fixed his problem.

J
-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 1:42 pm

What you just did here is a slap in the face to a lot of contributors
who worked hard on this code :(

Let me tell you about the history of this project first. Arjan wrote the
first version of it a year ago, and it was added to -rt and tested there
by many people and went through many iterations and fixes. Chris Wright
then created a x86_64 clockevents cleanup and dynticks enabling patchset
from it this spring and sent it to lkml three and a half months ago, on
March 31:

http://lwn.net/Articles/229094/

Thomas, the high resolution timers and clockevents maintainer,
immediately picked up Chris' splitup/splitout/cleanup work and fixed and
extended it, and sent a first cut to lkml on May 6th:

http://lwn.net/Articles/233226/

Thomas then sent an updated version of the x86_64 clockevents cleanup
and dynticks code to lkml (on June 10th), for a second round of review:

http://lwn.net/Articles/237687/

As Thomas stated it in his submission:

" The patch set has been tested in the -hrt and -rt trees for quite a
while and the initial problems have been sorted out. Thanks to the
folks from the PowerTop project for testing and feedback. "

Then on June 16th Thomas sent the third series:

http://lwn.net/Articles/238834/

(which too was in -rt and was tested there on numerous machines. It was
also added to -mm.)

Then on June 23rd Thomas sent the fourth series of the x86_64
clockevents and dynticks code:

http://lwn.net/Articles/239620/

We finally have someone (Thomas) with core kernel clue who actually
_cares_ about the x86 time code and does not see it as an ugly chore,
one who collects the right patches and maintains the -hrt tree and
co-maintains the -rt tree and interacts with other contributors. What he
did was _hard_ to do but we are making really good progress:

http://lkml.org/lkml/2007/7/5/242

" All in all, personally I'm very happy to see Linux making such a
huge step forward with tickless and can't w...

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:42 pm

Ingo, I'm sorry to say so, but your answer just convinced me that you're
wrong, and we MUST NOT take that code.

That was *exactly* the same thing you talked about when I refused to take
the original timer changes into 2.6.20. You were talking about how lots of
people had worked really hard, and how it was really tested.

And it damn well was NOT really tested, and 2.6.21 ended up being a
horribly painful experience (one of the more painful kernel releases in
recent times), and we ended up havign to fix a *lot* of stuff.

And you admitted you were wrong at the time.

Now you do the *exact* same thing.

Here's a big clue: it doesn't matter one _whit_ how much face-slapping you
get, or how much effort some programmers have put into the code. It's
untested. And no, we are *not* going to do another "rip everything out,
and replace it with new code" again.

Over my dead body.

We're going to do this thing gradually, or not at all.

And if somebody feels slighted by the face-slap, and thinks he has already
done enough, and isn't interested in doing it gradually, then good
riddance. The "not at all" seems like a good idea, and maybe we can
re-visit this in a year or two.

I'm not going to have another 2.6.21 on my hands.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:19 pm

yes - i was (way too!) upset about it, and your reasoning for the
rejection was hard (on us) but fair: you wanted a quiet 2.6.20, and you

yes. We had 12 -hrt/dynticks merge related regressions between
2.6.21-rc1 and -final, and 4 after final. Here's a quick post-mortem:

12 fixes after -rc1:

[PATCH] i386: Fix bogus return value in hpet_next_event()
[PATCH] clockevents: remove bad designed sysfs support for now
[PATCH] clocksource: Fix thinko in watchdog selection
[PATCH] dynticks: fix hrtimer rounding error in next_timer_interrupt
[PATCH] i386: add command line option "local_apic_timer_c2_ok"
[PATCH] i386: disable local apic timer via command line or dmi quirk
[PATCH] i386: clockevents fix breakage on Geode/Cyrix PIT
[PATCH] i386: trust the PM-Timer calibration of the local APIC timer
[PATCH] clockevents: Fix suspend/resume to disk hangs
[PATCH] highres: do not run the TIMER_SOFTIRQ after switching to highres mode
[PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
[PATCH] Save/restore periodic tick information over suspend/resume implementations

4 fixes after -final:

2.6.21.1: -
2.6.21.2:
[PATCH] clocksource: fix resume logic
2.6.21.3: -
2.6.21.4: -
2.6.21.5:
[PATCH] NOHZ: Rate limit the local softirq pending warning output
[PATCH] Ignore bogus ACPI info for offline CPUs
[PATCH] i386: HPET, check if the counter works
2.6.21.6: -

it's all pretty quiet today on the dynticks regressions front. (there
are no open regressions in either the upstream i386 code or in the devel
patches we are aware of. Forced-HPET in -mm, which is not part of this
queue in question [but which is done for dynticks], has one open
regression.)

The majority of the above bugs were in the infrastructure code. (the
worst was the generic resume/suspend one fixed in 2.6.21.2) And sadly, a
fair number of the infrastructure bugs we introduced during the frentic
clockevents/dynticks rewrites/redesigns we did...

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:45 pm

One thing I'll happily talk about is that while 2.6.21 was painful, you
and Thomas in particular were both very responsible about the thing, so
no, I'm not at all complaining or worried about it in that sense!

I just really _really_ wish we could have two fairly stable releases in a
row. I think 2.6.22 has the potential to be a pretty good setup, and I'd
really like to avoid having another 2.6.21 immediately afterwards.

So I'm not worried about integration and getting fixes when things break
per se, but I *am* worried that this is an area where we've traditionally
had lots of unexpected problems.

And hey, maybe this time there will be none. I just still smart from the
last time, so I'd prefer it to go more smoothly this time around.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:04 pm

Linus,

Can you please shed some light on me, how exactly you switch an
architecture gradually to clock events.

You simply can not convert PIT today and the HPET next week followed by
the local APIC in three month.

I have no problem to brew this for some more time. I got not repulsed by
the 2.6.20 decision, but I have no clue how to communicate with a black
hole.

tglx

-

To: Thomas Gleixner <tglx@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:20 pm

For example, we can make sure that the code in question that actually
touches the hardware stays exactly the same, and then just move the
interfaces around - and basically guarantee that _zero_ hardware-specific
issues pop up when you switch over, for example.

That way there is a gradual change-over.

The other approach (which would be nice _too_) is to actually try to
convert one clock source at a time. Why is that not an option?

Linus

-

To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:51 pm

That's not quite right. Leaving the code unchanged caused breakage
already. The PIT is damn stupid and can be sensitive to how quickly it's
programmed. So code that enable/disable didn't change, but frequency

It was that way for x86_64, that's the first thing I fixed (since it was
done by fully disabling all other timers but the one coverted ;-)
-

To: Chris Wright <chrisw@...>
Cc: Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 11, 2007 - 6:58 pm

Sure. We cannot avoid *all* problems. Bugs happen.

But at least we could try to make sure that there aren't totally
unnecessary changes in that switch-over patch. Which there definitely
were, as far as I can tell.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Chris Wright <chrisw@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 10:53 pm

one note is that the "talk differently to hardware" thing is in part
already tested with the 32 bit tickless code; a lot of people (80% ?)
are still using the 32 bit OS on their 64 bit machines, and the 32 bit
code already talks in the "new way" to this hardware....
(and since Fedora 7 already ships tickless for 32 bit there are quite a
lot of people using that in practice, in addition to the kernel.org
kernel users)

I would expect just about all the hardware interaction issues to have
popped up already because of this "run 32 bit on 64 bit hardware" thing.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:50 pm

Linus,

We need to give control to the clock events core code once we convert
one clock event device. Having two competing subsystems controlling
different devices (e.g. PIT and APIC) is not really desirable.

The HPET change, which is the larger part of the conversion set simply
because we now share the code with i386, might be split out by disabling
HPET in the first step, doing the PIT / APIC conversion and then the
HPET one in a separate step.

Thanks,

tglx

-

To: Thomas Gleixner <tglx@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:07 pm

But that misses the point. It means that the commit that actually
*changes* the code never actually gets tested on its own

Why not just fix up the HPET code so that it can be shared *first*.
Without the other conversion? Really - What's so wrong with the hpet.c
changes in the *absense* of conversion to clockevents? Those changes seem
to be totally independent - just abstracting ou tthe
"hpet_get_virt_address()" stuff etc.

None of that has anything to do with clockevents, as far as I can see.

In other words, you now change a i386-only file, and maybe it breaks
subtly on i386 as a result. Wouldn't it be nicer to see that breakage as a
separate event?

Then, the x86-64 clockevents code will switch over entirely, but now it
switches over to something we can say has gotten testing, and we know the
switch-over won't break any 32-bit code, because the switch-over literally
didn't change anything at all for that case.

See? THAT is what I mean by "gradual". Bugs happen, but if we can make
_independent_ bugs show up in _independent_ commits, that will make it
much easier to figure out what happened.

The same is true of a lot of the APIC timer code. Sure, that patch has the
actual conversion in it, and you don't have the cross-architecture issues,
but more than 50% of the patch seems to be just cleanup that is
independent of the actual switch-over, no?

Again, if it was done as a "one patch for cleanup, and another patch that
actually switches the higher-level interfaces around", then the two mostly
independent issues (of "hardware access/initialization" vs "higher-level
changes in how it got called") get done as two independent commits.

And no, I really probably wouldn't ask for this, but 2.6.21 showed
*exactly* this problem. Trivial debugging helps like "git bisect" didn't
help at all, because all the problems started when the new code was
"activated", not when it was actually brought in.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:36 pm

I don't think it's that much cleanup. One of my goals for x86-64 was always
to have it support modern x86 only; this means in particularly most of the
old bug workaround removed. With the APIC timer merging a lot of that crap
gets back in.

I would prefer to keep APIC code separate.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:58 pm

i dont think "clean, modern x86 code" will ever happen - x86_64 has and
is going to have the exact same type of crap. And i'll say a weird thing
now: that is a _blessing_. Why? Because this crap in question originates
from the _diversity_ of the platform, and that is a much larger asset
than the cost of the quirks can ever be!

What you suggest does not end up in "clean 64-bit code", it ends up in
"a bit less crappy 64-bit code", plus a lot of unnecessary duplication
of effort and duplication of code - which easily introduces more crap
total than it gets rid of ...

The x86 architecture isnt fully analogous to a random piece of device
hardware that evolves. It is more of a collector of random pieces of
hardware that evolve independently, and as such it will always be
exposed to human messups in a factorized way. "The pristine, clean
architecture" is an utopia and it will never come until humans design
hardware.

Under your scheme we'll end up with is two sets of code which share some
of the workarounds and dont share some others. No, in fact we _already_
ended up with two sets of code that is crappy in different ways. We had
countless cases of bugs fixed in i386 but not fixed in x86_64. (and vice
versa) Sharing code for similar hardware is almost always good.

I think the PowerPC experience (although it is not a fully equivalent
case) about them merging their 32-bit and 64-bit architectures was an
overwhelmingly positive move, and x86 could learn a thing or two from
that.

The only way to fight crappy hardware is to map it, to understand it and
to design as cleanly in the presence of it as possible. Having two sets
of code for the same thing hardly serves that purpose. In fact, having
_more_ crappy hardware _forces_ us to do a cleaner design (up to a pain
threshold).

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 8:07 pm

Yes, but it will be new crap, but no old crap anymore.

If you always pile the new crap on the old crap at some point
the whole thing might fall over. 64bit was intended as a fresh start.

Admittedly we're getting more and more workarounds too and sometimes
when I want to remove cruft i find out it is still needed on some
64bit boxes (e.g. see my repeated attempts to clean up the irq 0

The equivalent to the powerpc way would be essentially to report i386
into the x86-64 code base and leave the really old hardware only
in arch/i386. I've considered doing it, but it would be an awful
lot of work and to tempt distributions to actually use the new
port would require going back quite a long time. And at least
immediately it would end up with three cases to do things instead
of two like currently.

-Andi

-

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 8:18 pm

I think there's no such thing as a fresh start for a diverse
architecture - the ia64 failure has proven that. x86_64 CPUs still do
A20 emulation today (!). We still have people running industrial boards
on real i386 DX CPUs, with the latest upstream kernel. 15 years ago an
i386 DX was already quite obsolete. 32-bit is not going to go away in
our lifetime, and we'll want to support it in a first-grade way. We
better realize that prospect and have it right before our eyes in a
single tree wherever it makes sense to share code - i'm certainly not
talking about sharing mtrr/centaur.c or k8.c. (and i'm not necessarily
suggesting to share io_apic.c either - although it's certainly
borderline.)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 8:37 pm

x86-64 doesn't care about a lot of x86 baggage and a lot of
things have been even obsoleted in the platform.

In practice the backwards compatibility on x86 isn't that
great either. For example a significant number of new systems don't

Yes, but those for example would be perfectly happy with an arch/i386
with all APIC and SMP code stripped out.

Only the few people who still run dual P5s might not, but
those could continue using old kernels.

But eventually I think that would be the right clean way:

arch/i386 stripped down port for truly old systems like the embedded 386
upto 586 or early 686. No SMP or APIC.
arch/x86 supporting 32bit and 64bit for reasonably modern systems.
NUMAQ/Voyager/P5-SMP/visual workstation gone [frankly the user
base of those is too small to justify the code impact]

It's just quite ugly to get there and when you think through it
the actual advantages of such a setup it is likely not enough to
justify the significant work to make it work.

Also I wouldn't have any idea how to regression test significant
changes to arch/i386 aimed at old systems. e.g. I don't think
the powerpc people actually tried to still support really
old systems where it is hard to do regression tests anymore,
only really supported platforms.

So while such a setup would be quite nice the practical
problems of getting there are nasty. Also I must admit I prefer
hacking on new code instead.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 8:15 pm

Well that's just silly. The right way will never create 3 ways, but
always keep the limit to the existing 2 where the differences aren't
worth reconciling, and 1 for anything that is common.

It will be a fair amount of work, so any constructive input you have
upfront would be helpful.

thanks,
-chris
-

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:48 pm

Andi,

Care to look at the patch ? It _IS_ seperate.

Only HPET and PIT got shared.

tglx

-

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:29 pm

Linus,

Sure, I meant to do the HPET changes to i386 separate as a preparatory
patch.

Sharing HPET before the conversion is nasty at best (it involves a ton

Well, we know that it works on i386, but once we turn on the x64 switch
we have not tested the shared code for x64 yet.

I try to find some practicable compromise between the big bang patch and

I said before, that I'm going to split them further.

tglx

-

To: Thomas Gleixner <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Thursday, July 12, 2007 - 4:38 pm

Can't you take the entire legacy clock system and wrap it as a single
legacy clock source? Then you take bits out of the old system and put
them as independent sources in the new system? When the legacy clock
system is empty, you remove the legacy clock source.

--
Mathematics is the supreme nostalgia of our time.
-

To: Thomas Gleixner <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:03 pm

The timer specific changes (i.e. the merges between arches) can be
done more slowly, but the setup above is basically where I started,
and it was already broken on one of my test boxes. Anyway, I'll help
you however I can, because it's important to me to get this merged.

thanks,
-chris
-

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:16 pm

Well I spent a lot of time making the x86-64 timing code work
well on a variety of machines; working around a wide variety
of hardware and platform bugs. I obviously don't agree on your description

I told him my objections privately earlier. Basically i would
like to see an actually debuggable step-by-step change, not a rip everything
out.

If that isn't possible it needs very careful review which just hasn't
happened yet. But I'm not convinced even step by step is not possible
here.

I thought it was clear that rip everything out is rarely a good idea
in Linux land? That's really not something I should need to harp on
repeatedly.

-Andi

-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:46 pm

On Wed, 11 Jul 2007 23:16:38 +0200, Andi Kleen said:

I'm seeing a bit of a disconnect here. If you spent all that time making it
work, how come the guys who developed the patch are saying you didn't provide

Odd, I looked at the patchset fairly closely a number of times, as I was
hand-retrofitting the -rc[1-4] versions onto -rc[1-4]-mm kernels, and it looked
to *me* like it was a nice set of 20 or so step-by-step changes (bisectable
and everything - I got to do that once trying to figure out which one I botched).
Was there something in there that I missed?

To: <Valdis.Kletnieks@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:12 pm

The patch-set itself actually looks fine, as far as I'm concerned.

But it does seem to have that "enable everything in one go" problem.

I'd much rather see one time source at a time being converted, and enabled
then and there, so that when people report problems and do a bisection, if
it was HPET that broke, you get the commit that changed HPET.

As it is, looking at that set, it *looks* like you'd get the "ok, now
enable it all" as the commit that breaks, which tells you hardly anything,
since the commit that _shows_ the behaviour has absolutely nothing to do
with the code that actually causes it.

But yeah, the patch series per se doesn't look bad. If it wasn't for me
being burnt by the last big switch-over for timers, I probably wouldn't
mind it at all, personally.

Linus
-

To: <Valdis.Kletnieks@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:54 pm

I think Andi's referring to the existing x86_64 code, which gets
replaced by the patchset in question.
-

To: Chris Wright <chrisw@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, July 11, 2007 - 6:11 pm

<Takes a closer look at the patches> D'Oh! :) Yeah, the -rc4 version I'm
looking at is like a dozen 1-3K patches setting up and cleaning up, and then
one monster 65K patch doing the clockevents conversion, then another 6 or 8
small ones.

Yeah, that one big patch really doesn't look separable to me. But as I said,
I'm just a crash test dummy here. :)

Andrew - how do you feel about keeping this in the -mm tree until Linus,
Andi, Ingo, and Thomas get on the same page (which may be around the 2.6.24
merge window, by my guesstimate)?

To: <Valdis.Kletnieks@...>
Cc: Chris Wright <chrisw@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 11, 2007 - 6:33 pm

I think it should be.

That big patch really does do a *lot* more than just the "clockevents
conversion". It does all the hpet clock setup changes etc that are about
the hardware, and have *nothing* to do with actually changing the
interfaces.

For example, look at the hpet.c part of that patch. Totally independent
cleanups of everything else.

Or look at the changes to __setup_APIC_LVTT(). Same thing.

All the actual hardware interface changes are *totally* independent of the
software interface changes, and a lot of them are just cleanups.

But those hardware interface changes are easily the things that can break,
where some cleanup results in register writes being done in a different
order or something, and so if there's a bug there (and it's not visible on
most setups), now you cannot tell where the bug is.

Another example: setup_APIC_timer() used to wait for a timer interrupt
trigger to happen on the i8259 timer (or HPET). That code just got
removed (or maybe it got moved so subtly that I just don't see it).

What has that got to do with switching from the old timer interface to the
new one?

NOTHING.

So those kinds of changes that change hardware access functions should
have been done separately. Maybe there's a machine where that early
synchronization was necessary for some subtle timing reason. If so,
removing it sounds like a bug, no? Wouldn't it have been nice to see that
removal as a separate patch that was independent of the interface switch-
over?

I'd be a *lot* happier with switching over interfaces if I thought that
the low-level hardware drivers didn't change at the same time. But they
*do* change, afaik.

Linus
-

To: <Valdis.Kletnieks@...>
Cc: Chris Wright <chrisw@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, July 11, 2007 - 6:20 pm

Well, that's supposed to be Andi's tree and aggregated by Andrew into -mm.
But keeping it in -mm isn't the hard part. It's getting enough testing
to convince Linus it's safe, since there's no simple way to enable
clockevents in a slow manner. IOW, keeping it in -mm just postpones the
issue.
-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:46 pm

Hi Andi,

I'm going to change topic big time because your sentence above
perfectly applies to the O(1) scheduler too. It's not like process
schedulers are sacred and there shall be only one, while I/O
schedulers and packet schedulers are profane and there can be many of
them. FWIW IMHO the right way would have been to make the new
scheduler pluggable and switchable at runtime, too bad it was ripped
off instead. The difficulty of making the scheduler pluggable isn't
really enormous, there have been patches floating around to achieve
it, some I even deal with them myself once.

The only positive side of being forced to CFS I can imagine, is that
more testing will make it more stable and more tuned more quickly. But
I'm fairly certain Ingo's good enough to achieve without it, perhaps
with a few more weeks.

Personally I very much like the unfariness of O(1), I'm afraid CFS
will overschedule under a certain number of workloads in its attempt
to provide a complete fair queieing at all costs, and it won't deal
with the X server as nicely as O(1), but I may as well be wrong. The
only thing I'm more sure about is that the computational complexity is
higher, and that reason alone is a good technical reason to provide
both and let the java folks stick with O(1) if they want.
-

To: Andrea Arcangeli <andrea@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:09 pm

I disagree to a large degree.

We almost never have problems with code you can "think about".

Sure, bugs happen, but code that everybody runs the same generally doesn't
break. So a CPU scheduler doesn't worry me all that much. CPU schedulers
are "easy".

What worries me is interfaces to hardware that we know looks different for
different people. That means that any testing that one person has done
doesn't necessarily translate to anything at *all* on another persons
machine.

The timer problems we had when merging the stuff in 2.6.21 just scarred
me. I'd _really_ hate to have to go through that again. And no, the
"gradual" thing where the patch that actually *enables* something isn't
very gradual at all, so that's the absolutely worst kind of thing, because
then people can "git bisect" to the point where it got enabled and tell us
that's where things broke, but that doesn't actually say anything at all
about the patch that actually implements the new behaviour.

So the "enable" kind of patch is actually the worst of the lot, when it
comes to hardware.

When it comes to pure software algorithms, and things like schedulers,
you'll still obviously have timing issues and tuning, but generally things
*work*, which makes it a lot easier to debug and describe.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Thursday, July 12, 2007 - 10:23 pm

Hi,

A little more advance warning wouldn't have hurt though.
The new scheduler does _a_lot_ of heavy 64 bit calculations without any
attempt to scale that down a little...
One can blame me now for not having it brought up earlier, but discussions
with Ingo are not something I'm looking forward to. :(

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 13, 2007 - 12:40 am

I brought that up a couple of weeks ago, got handwaved at and gave up.

It still isn't obvious to me that all that arith needs to be 64-bit
on 32-bit machines, or even on 64-bit. 4e9 is a big number.
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 13, 2007 - 12:47 am

See prio_to_weight[], prio_to_wmult[] and sysctl_sched_stat_granularity.
Perhaps more can be done, but "without any attempt..." isn't accurate.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 13, 2007 - 1:23 pm

Hi,

Calculating these values at runtime would have been completely insane, the
alternative would be a crummy approximation, so using a lookup table is
actually a good thing. That's not the problem.
BTW could someone please verify the prio_to_wmult table, especially [16]
and [21] look a little off, like a digit was cut off.

While I'm at this, the 10% scaling there looks a little much (unless there
are other changes I haven't looked at yet), the old code used more like
5%. This would mean a prio -20 task would get 98.86% cpu time compared to
a prio 0 task, that was previously about the difference between -20 and
19 (and it would have previously gotten only 88.89%), now a prio -20 task
would get 99.98% cpu time compared to a prio 19 task.
The individual levels are unfortunately not that easily comparable, but at
the overall scale the change looks IMHO a little drastic.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Saturday, July 14, 2007 - 1:04 am

I meant see usage.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Subject: CFS review
Date: Tuesday, July 31, 2007 - 11:41 pm

Hi,

I more meant serious attempts. At this point I'm not that much interested
in a few localized optimizations, what I'm interested in is how can this
optimized at the design level (e.g. how can arch information be used to
simplify things). So I spent quite a bit of time looking through cfs and
experimenting with some ideas. I want to put the main focus on the
performance aspect, but there are a few other issues as well.

But first something else (especially for Ingo): I tried to be very careful
with any claims made in this mail, but this of course doesn't exclude the
possibility of errors, in which case I'd appreciate any corrections. Any
explanations done in this mail don't imply that anyone needs any such
explanations, they're done to keep things in context, so that interested
readers have a chance to follow even if they don't have the complete
background information. Any suggestions made don't imply that they have to
be implemented like this, there are more an incentive for further
discussion and I'm always interested in better solutions.

A first indication that something may not be quite right is the increase
in code size:

2.6.22:
text data bss dec hex filename
10150 24 3344 13518 34ce kernel/sched.o

recent git:
text data bss dec hex filename
14724 228 2020 16972 424c kernel/sched.o

That's i386 without stats/debug. A lot of the new code is in regularly
executed regions and it's often not exactly trivial code as cfs added
lots of heavy 64bit calculations. With the increased text comes
increased runtime memory usage, e.g. task_struct increased so that only
5 of them instead 6 fit now into 8KB.

Since sched-design-CFS.txt doesn't really go into any serious detail, so
the EEVDF paper was more helpful and after playing with the ideas a
little I noticed that the whole idea of fair scheduling can be explained
somewhat simpler and I'm a little surprised not finding it mentioned
anywhere.
...

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 11:46 am

yeah, thanks for the reminder, this is on my todo list. As i suspect you
noticed it too, much of the task_struct size increase is not fundamental
and not related to 64-bit math at all - it's simply debug and
instrumentation overhead.

Look at the following table (i386, nodebug):

size
----
pre-CFS 1328
CFS 1472
CFS+patch 1376

the very small patch below gets rid of 96 bytes. And that's only the
beginning.

Ingo

-------------------------------------------------->
---
include/linux/sched.h | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -905,23 +905,28 @@ struct sched_entity {
struct rb_node run_node;
unsigned int on_rq;

+ u64 exec_start;
+ u64 sum_exec_runtime;
u64 wait_start_fair;
+ u64 sleep_start_fair;
+
+#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
- u64 exec_start;
+ u64 wait_max;
+ s64 sum_wait_runtime;
+
u64 sleep_start;
- u64 sleep_start_fair;
- u64 block_start;
u64 sleep_max;
+ s64 sum_sleep_runtime;
+
+ u64 block_start;
u64 block_max;
u64 exec_max;
- u64 wait_max;
- u64 last_ran;

- u64 sum_exec_runtime;
- s64 sum_wait_runtime;
- s64 sum_sleep_runtime;
unsigned long wait_runtime_overruns;
unsigned long wait_runtime_underruns;
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 9:20 am

jiffies based sched_clock should be soon very rare. It's probably
not worth optimizing for it.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 9:33 am

Hi,

I'm not so sure about that. sched_clock() has to be fast, so many archs
may want to continue to use jiffies. As soon as one does that one can also
save a lot of computational overhead by using 32bit instead of 64bit.
The question is then how easy that is possible.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 10:17 pm

I have to say, it would be interesting to try to use 32-bit arithmetic.

I also think it's likely a mistake to do a nanosecond resolution. That's
part of what forces us to 64 bits, and it's just not even an *interesting*
resolution.

It would be better, I suspect, to make the scheduler clock totally
distinct from the other clock sources (many architectures have per-cpu
cycle counters), and *not* try to even necessarily force it to be a
"time-based" one.

So I think it would be entirely appropriate to

- do something that *approximates* microseconds.

Using microseconds instead of nanoseconds would likely allow us to do
32-bit arithmetic in more areas, without any real overflow.

And quite frankly, even on fast CPU's, the scheduler is almost
certainly not going to be able to take any advantage of the nanosecond
resolution. Just about anything takes a microsecond - including IO. I
don't think nanoseconds are worth the ten extra bits they need, if we
could do microseconds in 32 bits.

And the "approximates" thing would be about the fact that we don't
actually care about "absolute" microseconds as much as something that
is in the "roughly a microsecond" area. So if we say "it doesn't have
to be microseconds, but it should be within a factor of two of a ms",
we could avoid all the expensive divisions (even if they turn into
multiplications with reciprocals), and just let people *shift* the CPU
counter instead.

In fact, we could just say that we don't even care about CPU counters
that shift frequency - so what? It gets a bit further off the "ideal
microsecond", but the scheduler just cares about _relative_ times
between tasks (and that the total latency is within some reasonable
value), it doesn't really care about absolute time.

Hmm?

It would still be true that something that is purely based on timer ticks
will always be liable to have rounding errors that will inevitably mean
that you don...

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 7:23 pm

Hi,

The basic problem is that one needs a number of bits (at least 16) for
normalization, which limits the time range one can work with. This means
that 32 bit leaves only room for 1 millisecond resolution, the remainder
could maybe saved and reused later.
So AFAICT using micro- or nanosecond resolution doesn't make much
computational difference.

bye, Roman
-

To: Linus Torvalds <torvalds@...>
Cc: Roman Zippel <zippel@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 3:16 pm

Hi Linus,

On that theme, expressing the subsecond part of high precision time in
decimal instead of left-aligned binary always was an insane idea.
Applications end up with silly numbers of multiplies and divides
(likely as not incorrect) whereas they would often just need a simple
shift as you say, if the tv struct had been defined sanely from the
start. As a bonus, whenever precision gets bumped up, the new bits
appear on the right in formerly zero locations on the right, meaning
little if any code needs to change. What we have in the incumbent libc
timeofday scheme is the moral equivalent of BCD.

Of course libc is unlikely ever to repent, but we can at least put off
converting into the awkward decimal format until the last possible
instant. In other words, I do not see why xtime is expressed as a tv
instead of simple 32.32 fixed point. Perhaps somebody can elucidate
me?

Regards,

Daniel
-

To: Linus Torvalds <torvalds@...>
Cc: Roman Zippel <zippel@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 12:09 pm

yeah. Note that i largely detached sched_clock() from the GTOD
clocksources already in CFS, so part of this is already implemented and
the intention is clear. For example, when the following happens:

Marking TSC unstable due to: possible TSC halt in C2.
Clocksource tsc unstable (delta = -71630388 ns)

sched_clock() does _not_ stop using the TSC. It is very careful with the
TSC value, it checks against wraps, jumping, etc. (the whole rq_clock()
wrapper around sched_clock()), but still tries to use the highest
resolution time source possible, even if that time source is not good
enough for GTOD's purposes anymore. So the scheduler clock is already

Note that there is a relatively easy way of reducing the effects of such
intentional coupling: turn on CONFIG_HIGH_RES_TIMERS. That decouples the
scheduler tick from the jiffy tick and works against such 'exploits' -
_even_ if the scheduler clock is otherwise low resolution. Also enable
CONFIG_NO_HZ and the whole thing (of when the scheduler tick kicks in)
becomes very hard to predict.

[ So while in a low-res clock situation scheduling will always be less
precise, with hres-timers and dynticks we have a natural 'random
sampler' mechanism so that no task can couple to the scheduler tick -
accidentally or even intentionally.

The only 'unavoidable coupling' scenario is when the hardware has only
a single, low-resolution time sampling method. (that is pretty rare
though, even in the ultra-embedded space. If a box has two independent
hw clocks, even if they are low resolution, the timer tick can be

yeah. We tried to do as much of that as possible, please read on below
for (many) more details. There's no short summary i'm afraid :-/

Most importantly, CFS _already_ includes a number of measures that act
against too frequent math. So even though you can see 64-bit math code
in it, it's only rarely called if your clock has a low resolution - and
that happens all automatically! (see below the de...

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 6:38 pm

Hi,

You're comparing apples with oranges, I explicitely said:

"At this point I'm not that much interested in a few localized
optimizations, what I'm interested in is how can this optimized at the
design level"

IMO it's very important to keep computational and algorithmic complexity
separately, I want to concentrate on the latter, so unless you can _prove_
that a similiar set of optimizations is impossible within my example, I'm
going to ignore them for now. CFS has already gone through several
versions of optimization and tuning, expecting the same from my design
prototype is a little confusing...

I want to analyze the foundation CFS is based on, in the review I
mentioned a number of other issues and design related questions. If you
need more time, that's fine, but I'd appreciate more background
information related to that and not that you only jump on the more trivial

Come on, Ingo, you can do better than that, I did mention in my review
some of the requirements for the data types.
I'm amazed how you can get to that judgement so quickly, could you please
substantiate that a little more?

I admit that the lack of source comments is an open invitation for further
questions and Peter did exactly this and his comments were great - I'm
hoping for more like that. You OTOH jump to conclusions based on a partial
understanding what I'm actually trying to do.
Ingo, how about you provide some of the mathematical prove CFS is based
on? Can you prove that the rounding errors are irrelevant? Can you prove
that all the limit checks can have no adverse effect? I tried that and I'm
not entirely convinced of that, but maybe it's just me, so I'd love to see
someone else's attempt at this.
A major goal of my design is it to be able to define the limits within the
scheduler is working correctly, so I know which information is relevant
and what can be approximated.

bye, Roman
-

To: Linus Torvalds <torvalds@...>
Cc: Roman Zippel <zippel@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 12:57 am

I would add that I have been bothered by the 64-bit arithmetics when
trying to see what could be improved in the code. In fact, it's very
hard to optimize anything when you have arithmetics on integers larger
than the CPU's, and gcc is known not to emit very good code in this
situation (I remember it could not play with registers renaming, etc...).

However, I undertand why Ingo chose to use 64 bits. It has the advantage
that the numbers never wrap within 584 years. I'm well aware that it's
very difficult to keep tasks ordered according to a key which can wrap.

But if we consider that we don't need to be more precise than the return
value from gettimeofday() that all applications use, we see that a bunch
of microseconds is enough. 32 bits at the microsecond level wraps around
every hour. We may accept to recompute all keys every hour. It's not that
dramatic. The problem is how to detect that we will need to.

I remember a trick used by Tim Schmielau in his jiffies64 patch for 2.4.
He kept a copy of the highest bit of the lower word in the lowest bit of
the higher word, and considered that the lower one could not wrap before
we could check it. I liked this approach, which could be translated here
in something like the following :

Have all keys use 32-bit resolution, and monitor the 32nd bit. All tasks
must have the same value in this bit, otherwise we consider that their
keys have wrapped. The "current" value of this bit is copied somewhere.
When we walk the tree and find a task with a key which does not have its
32nd bit equal to the current value, it means that this key has wrapped,
so we have to use this information in our arithmetics.

When all keys have their 32nd bit different from the "current" value,
then we switch this value to reflect the new 32nd bit, and everything is
in sync again. The only requirement is that no key wraps around before
the "current" value is switched. This implies that no couple of tasks
could have their keys distant by more than 31 bits (35 minut...

To: Willy Tarreau <w@...>
Cc: Linus Torvalds <torvalds@...>, Roman Zippel <zippel@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 6:43 am

This has changed in recent gccs. It doesn't force register pairs anymore.

If you define an appropiate window and use some macros for the comparisons

gettimeofday() has too strict requirements, that make it unnecessarily slow

You don't need to recompute keys; just use careful comparisons using

If you're worried about wrapping in one hour why is wrapping in two
hours not a problem?

I have one request though. If anybody adds anything complicated
for this please make it optional so that 64bit platforms are not
burdened by it.

-Andi
-

To: Roman Zippel <zippel@...>
Cc: Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 10:36 am

i think Andi was talking about the vast majority of the systems out
there. For example, check out the arch demography of current Fedora
installs (according to the Smolt opt-in UUID based user metrics):

http://smolt.fedoraproject.org/

i686: 74743
x86_64: 18599
i386: 1208
ppc: 527
ppc64: 396
sparc64: 14
---------------
Total: 95488

even pure i386 (kernels, not systems) is a only 1.2% of all installs. By
the time the CFS kernel gets into a distro (a few months at minimum,
typically a year) this percentage will go down further. And embedded
doesnt really care about task-statistics corner cases [ (it likely
doesnt have 'top' installed - likely doesnt even have /proc mounted or
even built in ;-) ].

of course CFS should not do _worse_ stats than what we had before, and
should not break or massively misbehave. Also, anything sane we can do
for low-resolution arches we should do (and we already do quite a bit -
the while wmult stuff is to avoid expensive divisions) - and i regularly
booted CFS with a low-resolution clock to make sure it works. So i'm not
trying to duck anything, we've just got to keep our design priorities
right :-)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 12:11 pm

I meant that in many cases where the TSC is considered unreliable
today it'll be possible to use it anyways at least for sched_clock()
(and possibly even gtod())

The exception would be system which really have none, but there
should be very few of those.

-Andi
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 10:40 am

sched_yield() is being reworked at the moment. But in general we want
apps to move away to sane locking constructs ASAP. There's some movement
in the 3D space at least.

Ingo
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 7:37 am

that's without CONFIG_SMP, right? :-) On SMP they are about net break
even:

text data bss dec hex filename
26535 4173 24 30732 780c kernel/sched.o-2.6.22
28378 2574 16 30968 78f8 kernel/sched.o-2.6.23-git

(plus a further ~1.5K per CPU data reduction which is not visible here)

btw., here's the general change in size of a generic vmlinux from .22 to
.23-git, using the same .config:

text data bss dec hex filename
5256628 520760 1331200 7108588 6c77ec vmlinux.22
5306918 535844 1327104 7169866 6d674a vmlinux.23-git

+50K. (this was on UP)

In any case, there's still some debugging code in the scheduler (beyond
SCHED_DEBUG), i'll work some more on reducing it.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 8:27 am

Hi,

That's still quite an increase in some rather important code paths and
it's not just the code size, but also code complexity which is important

That's why I mentioned the increased runtime memory usage...

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 7:22 am

Mike and me have managed to reproduce similarly looking 'top' output,
but it takes some effort: we had to deliberately run a non-TSC
sched_clock(), CONFIG_HZ=100, !CONFIG_NO_HZ and !CONFIG_HIGH_RES_TIMERS.

in that case 'top' accounting symptoms similar to the above are not due
to the scheduler starvation you suspected, but due the effect of a
low-resolution scheduler clock and a tightly coupled timer/scheduler
tick to it. I tried the very same workload on 2.6.22 (with the same
.config) and i saw similarly anomalous 'top' output. (Not only can one
create really anomalous CPU usage, one can completely hide tasks from
'top' output.)

if your test-box has a high-resolution sched_clock() [easily possible]
then please send us the lt.c and l.c code so that we can have a look.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 11:04 pm

..which is pretty much the state of play for lots of non-x86 hardware.

--
Mathematics is the supreme nostalgia of our time.
-

To: Matt Mackall <mpm@...>
Cc: Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 11:57 pm

question is if it's significantly worse than before. With a 100 or
1000Hz timer, you can't expect perfect fairness just due to the
extremely rough measurement of time spent...

-

To: Arjan van de Ven <arjan@...>
Cc: Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 12:38 am

Indeed. I'm just pointing out that not having TSC, fast HZ, no-HZ
mode, or high-res timers should not be treated as an unusual
circumstance. That's a PC-centric view.

--
Mathematics is the supreme nostalgia of our time.
-

To: Matt Mackall <mpm@...>
Cc: Arjan van de Ven <arjan@...>, Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 5:29 am

The question is if it would be that hard to add TSC equivalent
sched_clock() support to more systems. At least a lot of CPUs I have
ever looked at had some kind of fast clock available. Perhaps it's
more laziness of the developers or cut'n'paste that these are not as
widely used as they should be?

-Andi
-

To: Matt Mackall <mpm@...>
Cc: Arjan van de Ven <arjan@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 4:44 am

actually, you dont need high-res or fast HZ or TSC to reduce those timer
artifacts: all you need is _two_ (low-res, slow) hw clocks.

Most platforms do have that (even the really really cheap ones), but
arches do not set up the scheduler tick one of them and the timer tick
to the other, and to skew the periodic-timer programming setup a bit (by
nature of physics they are usually already skewed a bit) so that the
scheduler tick and timer tick are not coupled. This whole thing is not a
big deal on embedded anyway. (you dont get students log in to the
toaster or to the fridge to run timer exploits, do you? :-)

Ingo
-

To: Arjan van de Ven <arjan@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 12:18 am

Well, at least we're able to *measure* that task 'l' used 3.3% and
that tasks 'lt' used 32%. If we're able to measure it, then that's
already fine enough to be able to adjust future timeslices credits.
Granted it may be rough for small periods (a few jiffies), but it
should be fair for larger periods. Or at least it should *report*
some fair distribution.

Willy

-

To: Willy Tarreau <w@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 12:31 am

but the testcase here uses a LOT shorter time than jiffies... not "a few
jiffies".

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

To: Arjan van de Ven <arjan@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 3, 2007 - 12:53 am

But if we rely on the same sampling method, at least we will report
something consistent with what happens. And sampling is often the
correct method to get finer resolution on a macroscopic scale.

I mean, we're telling users that we include the "completely fair scheduler"
in 2.6.23, a scheduler which will ensure that all tasks get a fair share of
CPU time. A user starts top and sees 33%+32%+32+3% for 4 tasks while he
would have expected to see 25%+25%+25%+25%. You can try to explain users
that it's the fairest distribution, but they will have a hard time believing
it, especially when they measure the time spent on CPU with the "time"
command. OK this is all sampling, but we should try to avoid relying on
different sources of data for computation and reporting. Time and Top
should report something close to 4*25% for comparable tasks. And if not,
because of some sampling problem, maybe the scheduler cannot be that fair
in some situations, but either it should make use of the sampling time
and top use, or top and time should rely on the view of the scheduler.

I'll try to quickly hack up a program which makes use of rdtsc from
userspace to precisely measure user-space time, and disable TSC use
from the kernel to see how the values diverge.

Regards,
Willy

-

To: Ingo Molnar <mingo@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 8:21 am

Hi,

I used my old laptop for these tests, where tsc is indeed disabled due to

Well, it magnifies the rounding problems in CFS.
I mainly wanted to test a little the behaviour of CFS and I thought a saw
patch which enabled the use of TSC in these cases, so I didn't check
sched_clock().

Anyway, I want to point out that this wasn't the main focus of what I
wrote.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 9:59 am

why do you say that? 2.6.22 behaves similarly with a low-res
sched_clock(). This has nothing to do with 'rounding problems'!

i tried your fl.c and if sched_clock() is high-resolution it's scheduled
_perfectly_ by CFS:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5906 mingo 20 0 1576 244 196 R 71.2 0.0 0:30.11 l
5909 mingo 20 0 1844 344 260 S 9.6 0.0 0:04.02 lt
5907 mingo 20 0 1844 508 424 S 9.5 0.0 0:04.01 lt
5908 mingo 20 0 1844 344 260 S 9.5 0.0 0:04.02 lt

if sched_clock() is low-resolution then indeed the 'lt' tasks will
"hide":

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2366 mingo 20 0 1576 248 196 R 99.9 0.0 0:07.95 loop_silent
1 root 20 0 2132 636 548 S 0.0 0.0 0:04.64 init

but that's nothing new. CFS cannot conjure up time measurement methods
that do not exist. If you have a low-res clock and if you create an app
that syncs precisely to the tick of that clock via timers that run off
that exact tick then there's nothing the scheduler can do about it. It
is false to charachterise this as 'sleeper starvation' or 'rounding
error' like you did. No amount of rounding logic can create a
high-resolution clock out of thin air.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 11:44 am

Hi,

Please calm down. You apparantly already get worked up about one of the
secondary problems. I didn't say 'sleeper starvation' or 'rounding
error', these are your words and it's your perception of what I said.

sched_clock() can have a low resolution, which can be a problem for the
scheduler. This is all this program demonstrates. If and how this problem
should be solved is a completely different issue, about which I haven't
said anything yet and since it's not that important right now I'll leave
it at that for now.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 1:41 pm

Oh dear :-) It was indeed my preception that yesterday you said:

| A problem here is that this can be exploited, if a job is spread over
| ^^^^^^^^^^^^^^^^^^^^^
| a few threads, they can get more time relativ to other tasks, e.g. in
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| this example there are three tasks that run only for about 1ms every
| 3ms, but they get far more time than should have gotten fairly:
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| 4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
| 4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
| 4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
| 4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l

[ http://lkml.org/lkml/2007/7/31/668 ]

( the underlined portion, in other words, is called 'starvation'.)

And again today, i clearly perceived you to say:

| > in that case 'top' accounting symptoms similar to the above are not
| > due to the scheduler starvation you suspected, but due the effect of
| > a low-resolution scheduler clock and a tightly coupled
| > timer/scheduler tick to it.
|
| Well, it magnifies the rounding problems in CFS.

[ http://lkml.org/lkml/2007/8/1/153 ]

But you are right, that must be my perception alone, you couldnt
possibly have said any of that =B-)

Or are you perhaps one of those who claims that saying something
analogous to sleeper starvation does not equal to talking about 'sleeper
starvation' and saying something about 'rounding problems in CFS' does
in no way mean you were talking about rounding errors? :-)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 2:14 pm

Hi,

*sigh* and here you go off again nitpicking on a minor issue just to prove
your point...
When I wrote the earlier stuff I hadn't realized it was resolution
related, so things have to be put into proper context and you make it
yourself a little easy by equating them.
Yippi, you found another small error I made, can we drop this now? Please?

bye, Roman
-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 10:04 am

CFS is only as fair as your clock is good.

-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 8:23 am

please send all the debug info and source code we asked for - thanks!

Ingo
-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 3:12 am

Roman,

Thanks for the testing and the feedback, it's much appreciated! :-) On
what platform did you do your tests, and what .config did you use (and
could you please send me your .config)?

Please also send me the output of this script:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

(if the output is too large send it to me privately, or bzip2 -9 it.)

Could you also please send the source code for the "l.c" and "lt.c" apps
you used for your testing so i can have a look. Thanks!

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 9:19 am

Hi,

l.c is a simple busy loop (well, with the option to start many of them).
This is lt.c, what it does is to run a bit less than a jiffie, so it
needs a low resolution clock to trigger the problem:

#include <stdio.h>
#include <signal.h>
#include <time.h>
#include <sys/time.h>

#define NSEC 1000000000
#define USEC 1000000

#define PERIOD (NSEC/1000)

int i;

void worker(int sig)
{
struct timeval tv;
long long t0, t;

gettimeofday(&tv, 0);
//printf("%u,%lu\n", i, tv.tv_usec);
t0 = (long long)tv.tv_sec * 1000000 + tv.tv_usec + PERIOD / 1000 - 50;
do {
gettimeofday(&tv, 0);
t = (long long)tv.tv_sec * 1000000 + tv.tv_usec;
} while (t < t0);

}

int main(int ac, char **av)
{
int cnt;
timer_t timer;
struct itimerspec its;
struct sigaction sa;

cnt = i = atoi(av[1]);

sa.sa_handler = worker;
sa.sa_flags = 0;
sigemptyset(&sa.sa_mask);

sigaction(SIGALRM, &sa, 0);

clock_gettime(CLOCK_MONOTONIC, &its.it_value);
its.it_interval.tv_sec = 0;
its.it_interval.tv_nsec = PERIOD * cnt;

while (--i > 0 && fork() > 0)
;

its.it_value.tv_nsec += i * PERIOD;
if (its.it_value.tv_nsec > NSEC) {
its.it_value.tv_sec++;
its.it_value.tv_nsec -= NSEC;
}

timer_create(CLOCK_MONOTONIC, 0, &timer);
timer_settime(timer, TIMER_ABSTIME, &its, 0);

printf("%u,%lu\n", i, its.it_interval.tv_nsec);

while (1)
pause();
return 0;
}

-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 11:07 am

thanks. Just to make sure, while you said that your TSC was off on that
laptop, the bootup log of yours suggests a working TSC:

Time: tsc clocksource has been installed.

and still your fl.c testcases produces the top output that you've
reported in your first mail? If so then this could be a regression. Or
did you turn off the tsc manually via notsc? (or was it with a different
.config or on a different machine)? Please help us figure this out
exactly, we dont want a real regression go unnoticed.

If you can reproduce that problem with a working TSC then please
generate a second cfs-debug-info.sh snapshot _while_ your fl+l workload
is running and send that to me (i'll reply back to it publicly). Thanks,

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 1:10 pm

Standard kernels often disable the TSC later after running a bit
with it (e.g. on any cpufreq change without p state invariant TSC)

-Andi
-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 12:27 pm

I assume that what Roman hit was that he had explicitly disabled the TSC
because of TSC instability with the "notsc" kernel command line. Which
disabled is *entirely*.

That *used* to be the right thing to do, since the gettimeofday() logic
originally didn't know about TSC instability, and it just resulted in
somewhat flaky timekeeping.

These days, of course, we should notice it on our own, and just switch
away from the TSC as a reliable clock-source, but still allow it to be
used for the cases where absolute accuracy is not a big issue.

So I suspect that Roman - by virtue of being an old-timer - ends up having
a workaround for an old problem that isn't needed, and that in turn ends
up meaning that his scheduler clock also ends up using the really not very
good timer tick..

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 1:50 pm

but that does not appear to be the case, the debug info i got from Roman
includes the following boot options:

Kernel command line: auto BOOT_IMAGE=2.6.23-rc1-git9 ro root=306

there's no "notsc" option there.

Andi's theory cannot be true either, Roman's debug info also shows this
/proc/<PID>/sched data:

clock-delta : 95

that means that sched_clock() is in high-res mode, the TSC is alive and
kicking and a sched_clock() call took 95 nanoseconds.

Roman, could you please help us with this mystery?

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 2:01 pm

Hi,

Actually, Andi is right. What I sent you was generated directly after
boot, as I had to reboot for the right kernel, so a little later appeared
this:

Aug 1 14:54:30 spit kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
Aug 1 15:09:56 spit kernel: Clocksource tsc unstable (delta = 656747233 ns)
Aug 1 15:09:56 spit kernel: Time: pit clocksource has been installed.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 3:05 pm

just to make sure, how does 'top' output of the l + "lt 3" testcase look
like now on your laptop? Yesterday it was this:

4544 roman 20 0 1796 520 432 S 32.1 0.4 0:21.08 lt
4545 roman 20 0 1796 344 256 R 32.1 0.3 0:21.07 lt
4546 roman 20 0 1796 344 256 R 31.7 0.3 0:21.07 lt
4547 roman 20 0 1532 272 216 R 3.3 0.2 0:01.94 l

and i'm still wondering how that output was possible.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 9, 2007 - 7:14 pm

Hi,

I disabled the jiffies logic and the result is still the same, so this
problem isn't related to resolution at all.
I traced it a little and what's happing is that the busy loop really only
gets little time, it only runs inbetween the timer tasks. When the timer
task is woken up __enqueue_sleeper() updates sleeper_bonus and a little
later when the busy loop is preempted __update_curr() is called a last
time and it's fully hit by the sleeper_bonus. So the timer tasks use less
time than they actually get and thus produce overflows, the busy loop OTOH
is punished and underflows.
So it seems my initial suspicion was right and this logic is dodgy, what
is it actually supposed to do? Why is some random task accounted with the
sleeper_bonus?

bye, Roman

PS: Can I still expect answer about all the other stuff?
-

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 3:23 am

I still can't reproduce this here. Can you please send your .config, so
I can try again with a config as close to yours as possible?

-Mike

-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 1:49 am

how did you disable the jiffies logic? Also, could you please send me
the cfs-debug-info.sh:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

captured _while_ the above workload is running. This is the third time
i've asked for that :-)

to establish that the basic sched_clock() behavior is sound on that box,
could you please also run this tool:

http://people.redhat.com/mingo/cfs-scheduler/tools/tsc-dump.c

please run it both while the system is idle, and while there's a CPU hog
running:

while :; do :; done &

and send me that output too? (it's 2x 60 lines only) Thanks!

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 9:52 am

Hi,

Is there any reason to believe my analysis is wrong?
So far you haven't answered a single question about the CFS design...

Anyway, I give you something better - the raw trace data for 2ms:

1186747669.274790012: update_curr 0xc7fb06f0,479587,319708,21288884188,159880,7360532
1186747669.274790375: dequeue_entity 0xc7fb06f0,21280402988,159880
1186747669.274792580: sched 2848,2846,0xc7432cb0,-7520413
1186747669.274820987: update_curr 0xc7432ce0,29302,-130577,21288913490,1,-7680293
1186747669.274821269: dequeue_entity 0xc7432ce0,21296077409,1
1186747669.274821930: enqueue_entity 0xc7432ce0,21296593783,1
1186747669.274826979: update_curr 0xc7432ce0,5707,5707,21288919197,1,-7680294
1186747669.274827724: enqueue_entity 0xc7432180,21280919197,639451
1186747669.274829948: update_curr 0xc7432ce0,1553,-318172,21288920750,319726,-8000000
1186747669.274831878: sched 2846,2847,0xc7432150,8000000
1186747669.275789883: update_curr 0xc7432180,479797,319935,21289400547,159864,7360339
1186747669.275790295: dequeue_entity 0xc7432180,21280919197,159864
1186747669.275792439: sched 2847,2846,0xc7432cb0,-7520203
1186747669.275820819: update_curr 0xc7432ce0,29238,-130625,21289429785,1,-7680067
1186747669.275821109: dequeue_entity 0xc7432ce0,21296593783,1
1186747669.275821763: enqueue_entity 0xc7432ce0,21297109852,1
1186747669.275826887: update_curr 0xc7432ce0,5772,5772,21289435557,1,-7680068
1186747669.275827652: enqueue_entity 0xc7fb0ca0,21281435557,639881
1186747669.275829826: update_curr 0xc7432ce0,1549,-318391,21289437106,319941,-8000000
1186747669.275831584: sched 2846,2849,0xc7fb0c70,8000000

About the values:

update_curr: sched_entity, delta_fair, delta_mine, fair_clock, sleeper_bonus, wait_runtime
(final values at the end of __update_curr)
{en,de}queue_entity: sched_entity, fair_key, sleeper_bonus
(at the start of __enqueue_entity/__dequeue_entity)
sched: prev_pid,pid,current,wait_runtime
(at the end of scheduling, note that current has a small structure
offset to sched_entity)
...

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 12:54 pm

Not yet, but if you give Ingo what he wants (as opposed to what you're
giving him) it'll be easier for him to answer what's going wrong, and
perhaps "fix" the problem to boot.

(The scripts gives info about CPU characteristics, interrupts,
modules, etc. -- you know, all those "unknown" variables.)

And perhaps a patch to show what parts you commented out, too, so one
can tell if anything got broken (unintentionally).

--
Michael Chang

Please avoid sending me Word or PowerPoint attachments. Send me ODT,
RTF, or HTML instead.
See http://www.gnu.org/philosophy/no-word-attachments.html
Thank you.
-

To: Michael Chang <thenewme91@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 1:25 pm

Hi,

He already has most of this information and the trace shows _exactly_
what's going on. All this information should be more than enough to allow
an initial judgement whether my analysis is correct.
Also none of this information is needed to explain the CFS logic a little
more, which I'm still waiting for...

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Michael Chang <thenewme91@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 3:47 pm

Roman,

fortunately all bug reporters are not like you. It's amazing how long
you can resist sending a simple bug report to a developer! Maybe you
consider that you need to fix the bug by yourself after you understand
the code, but if you systematically refuse to return the small information
Ingo asks you, we will have to wait for some more cooperative users to be
hit by the same bug when 2.6.23 is released, which is stupid.

I thought you could at least understand that one developer who is used
to read traces from the same tool every day will be far faster at decoding
a trace from the same tool than trying to figure out what your self-maid
dump means.

It's the exact same reason I ask for pcap files when people send me
outputs of tcpdumps without the information I *need*.

I you definitely do not want to cooperate, stop asking for a personal
explanation, and go figure by yourself how the code works. BTW, in the
trace you "kindly offered" in exchange for the cfs-debug-info dump,
you show several useful variables, but nothing says where they are
captured. And as you can see, they're changing. That's a fantastic
trace for a developer, really...

Please try to be a little bit more transparent if you really want the
bugs fixed, and don't behave as if you wanted this bug to survive
till -final.

Thanks,
Willy

-

To: Willy Tarreau <w@...>
Cc: Michael Chang <thenewme91@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 5:15 pm

Hi,

I'm more amazed how long Ingo can resist providing some explanations (not
just about this problem).
It's not like I haven't given him anything, he already has the test
programs, he already knows the system configuration.

Could you please ask Ingo the same? I'm simply trying to get some
transparancy into the CFS design. Without further information it's
difficult to tell, whether something is supposed to work this way or it's
a bug.

In this case it's quite possible that due to a recent change my testcase
doesn't work anymore. Should I consider the problem fixed or did it just
go into hiding? Without more information it's difficult to verify this
independently.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Michael Chang <thenewme91@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Saturday, August 11, 2007 - 1:15 am

It's a matter of time balance. It takes a short time to send the output
of a script, and it takes a very long time to explain how things work.
I often encounter the same situation with haproxy. People ask me to
explain them in detail how this or that would apply to their context, and
it's often easier for me to provide them with a 5-lines patch to add the
feature they need, than to spend half an hour explaining why and how it

I know that Ingo tends to reply to a question with another question. But
as I said, imagine if he has to explain the same things to each person
who asks him for it. I think that a more constructive approach would be
to point what is missing/unclear/inexact in the doc so that he adds some
paragraphs for you and everyone else. If you need this information to debug,

generally, problems that appear only on one person's side and which suddenly
disappear are either caused by some random buggy patch left in the tree (not
your case it seems), or by an obscure bug of the feature being tested which
will resurface from time to time as long as it's not identified.

Willy

-

To: Roman Zippel <zippel@...>
Cc: Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 5:36 pm

one more small thing: could you please send your exact .config (Mike
asked for that too, and i too on two prior occasions). Sometimes
unexpected little details in the .config make a difference, we are not
asking you that because we are second-guessing you in any way, the
reason is simple: i frequently boot _the very .config that others use_,
and see surprising reproducability of bugs that i couldnt trigger
before. It's standard procedure to just pick up the .config of others to
eliminate a whole bunch of degrees of freedom for a bug to hide behind -
and your "it's a pretty standard config" description doesnt really
achieve that. It probably wont make a real difference, but it's really
easy for you to send and it's still very useful when one tries to
eliminate possibilities and when one wants to concentrate on the
remaining possibilities alone. Thanks again,

Ingo
-

To: Roman Zippel <zippel@...>
Cc: Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 8:30 pm

everything looks good in your debug output and the TSC dump data, except
for the wait_runtime values, they are quite out of balance - and that
balance cannot be explained with jiffies granularity or with any sort of
sched_clock() artifact. So this clearly looks like a CFS regression that
should be fixed.

the only relevant thing that comes to mind at the moment is that last
week Peter noticed a buggy aspect of sleeper bonuses (in that we do not
rate-limit their output, hence we 'waste' them instead of redistributing
them), and i've got the small patch below in my queue to fix that -
could you give it a try?

this is just a blind stab into the dark - i couldnt see any real impact
from that patch in various workloads (and it's not upstream yet), so it
might not make a big difference. The trace you did (could you send the
source for that?) seems to implicate sleeper bonuses though.

if this patch doesnt help, could you check the general theory whether
it's related to sleeper-fairness, via turning it off:

echo 30 > /proc/sys/kernel/sched_features

does the bug go away if you do that? If sleeper bonuses are showing too
many artifacts then we could turn it off for final .23.

Ingo

--------------------->
Subject: sched: fix sleeper bonus
From: Ingo Molnar <mingo@elte.hu>

Peter Ziljstra noticed that the sleeper bonus deduction code was not
properly rate-limited: a task that scheduled more frequently would get a
disproportionately large deduction. So limit the deduction to delta_exec
and limit production to runtime_limit.

Not-Yet-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched_fair.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -75,7 +75,7 @@ enum {

unsigned int sysctl_sched_features __read_mostly =
SCHED_FEAT_FAIR_SLEEP...

To: Ingo Molnar <mingo@...>
Cc: Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Monday, August 20, 2007 - 6:19 pm

Hi,

It doesn't make much of a difference. OTOH if I disabled the sleeper code
completely in __update_curr(), I get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3139 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt
3138 roman 20 0 1796 344 256 R 21.7 0.3 0:02.68 lt
3137 roman 20 0 1796 520 432 R 21.7 0.4 0:02.68 lt
3136 roman 20 0 1532 268 216 R 34.5 0.2 0:06.82 l

Disabling this code completely via sched_features makes only a minor
difference:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3139 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt
3138 roman 20 0 1796 344 256 R 20.4 0.3 0:09.94 lt
3137 roman 20 0 1796 520 432 R 20.4 0.4 0:09.94 lt

Can we please skip to the point, where you try to explain the intention a
little more?
If I had to guess that this is supposed to keep the runtime balance, then
it would be better to use wait_runtime to adjust fair_clock, from where it
would be evenly distributed to all tasks (but this had to be done during
enqueue and dequeue). OTOH this also had then a consequence for the wait
queue, as fair_clock is used to calculate fair_key.
IMHO current wait_runtime should have some influence in calculating the
sleep bonus, so that wait_runtime doesn't constantly overflow for tasks
which only run occasionally.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Tuesday, August 21, 2007 - 3:33 am

I thought this was history. With your config, I was finally able to
reproduce the anomaly (only with your proggy though), and Ingo's patch
does indeed fix it here.

Freshly reproduced anomaly and patch verification, running 2.6.23-rc3
with your config, both with and without Ingo's patch reverted:

6561 root 20 0 1696 492 404 S 32.0 0.0 0:30.83 0 lt
6562 root 20 0 1696 336 248 R 32.0 0.0 0:30.79 0 lt
6563 root 20 0 1696 336 248 R 32.0 0.0 0:30.80 0 lt
6564 root 20 0 2888 1236 1028 R 4.6 0.1 0:05.26 0 sh

6507 root 20 0 2888 1236 1028 R 25.8 0.1 0:30.75 0 sh
6504 root 20 0 1696 492 404 R 24.4 0.0 0:29.26 0 lt
6505 root 20 0 1696 336 248 R 24.4 0.0 0:29.26 0 lt
6506 root 20 0 1696 336 248 R 24.4 0.0 0:29.25 0 lt

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Tuesday, August 21, 2007 - 7:54 am

Hi,

I did update to 2.6.23-rc3-git1 first, but I ended up reverting the patch,
as I didn't notice it had been applied already. Sorry about that.
With this patch the underflows are gone, but there are still the
overflows, so the questions from the last mail still remain.

bye, Roman
-

To: Mike Galbraith <efault@...>
Cc: Roman Zippel <zippel@...>, Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Tuesday, August 21, 2007 - 4:35 am

oh, great! I'm glad we didnt discard this as a pure sched_clock
resolution artifact.

Roman, a quick & easy request: please send the usual cfs-debug-info.sh
output captured while your testcase is running. (Preferably try .23-rc3
or later as Mike did, which has the most recent scheduler code, it
includes the patch i sent to you already.) I'll reply to your
sleeper-fairness questions separately, but in any case we need to figure
out what's happening on your box - if you can still reproduce it with
.23-rc3. Thanks,

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Willy Tarreau <w@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 6:50 pm

Hi,

The thing I'm afraid about CFS is its possible unpredictability, which
would make it hard to reproduce problems and we may end up with users with
unexplainable weird problems. That's the main reason I'm trying so hard to
push for a design discussion.

Just to give an idea here are two more examples of irregular behaviour,
which are hopefully easier to reproduce.

1. Two simple busy loops, one of them is reniced to 15, according to my
calculations the reniced task should get about 3.4% (1/(1.25^15+1)), but I
get this:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4433 roman 20 0 1532 300 244 R 99.2 0.2 5:05.51 l
4434 roman 35 15 1532 72 16 R 0.7 0.1 0:10.62 l

OTOH upto nice level 12 I get what I expect.

2. If I start 20 busy loops, initially I see in top that every task gets
5% and time increments equally (as it should):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4492 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4491 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4490 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4489 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4488 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4487 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4486 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4485 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4484 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4483 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4482 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4481 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4480 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4479 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4478 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4477 roman 20 0 1532 68 16 R 5.0 0.1 0:02.86 l
4476 roman 20 0 1532 ...

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Saturday, August 11, 2007 - 1:28 am

You may be interested by looking at the very early CFS versions. The design
was much more naive and understandable. After that, a lot of tricks have
been added to take into account a lot of uses and corner cases, which may

Do you see this only at -15, or starting with -15 and below ?

Willy

-

To: Willy Tarreau <w@...>
Cc: Roman Zippel <zippel@...>, Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Sunday, August 12, 2007 - 1:17 am

note that the typo was not in the weight table but in the inverse weight
table which didnt really affect CPU utilization (that's why we didnt
notice the typo sooner). Regarding the above problem with nice +15 being
beefier than intended i'd suggest to re-test with a doubled
/proc/sys/kernel/sched_runtime_limit value, or with:

echo 30 > /proc/sys/kernel/sched_features

i think this was scheduling jitter caused by the larger granularity of
negatively reniced tasks. This got improved recently, with latest -git i
get:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3108 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3109 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3110 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3111 root 5 -15 1576 244 196 R 5.0 0.0 0:07.26 loop_silent
3112 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent
3113 root 5 -15 1576 248 196 R 5.0 0.0 0:07.26 loop_silent

that's picture-perfect CPU time distribution. But, and that's fair to
say, i never ran such an artificial workload of 20x nice -15 infinite
loops (!) before, and boy does interactivity suck (as expected) ;)

Ingo
-

To: Roman Zippel <zippel@...>
Cc: Michael Chang <thenewme91@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 3:44 pm

i'll need the other bits of information too to have a complete picture
about what's going on while your test is running - to maximize the
chances of me being able to fix it. I'm a bit perplexed (and a bit
worried) about this - you've spent _far_ more effort to _not send_ that
script output (captured while the workload is running) than it would
have taken to do it :-/ If you'd like me to fix bugs then please just
send it (in private mail if you want) - or give me an ssh login to that
box - whichever variant you prefer. Thanks,

Ingo
-

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 12:47 pm

I guess I'm going to have to give up on trying to reproduce this... my
3GHz P4 is just not getting there from here. Last attempt, compiled UP,
HZ=1000 dynticks, full preempt and highres timers fwiw.

6392 root 20 0 1696 332 248 R 25.5 0.0 3:00.14 0 lt
6393 root 20 0 1696 332 248 R 24.9 0.0 3:00.15 0 lt
6391 root 20 0 1696 488 404 R 24.7 0.0 3:00.20 0 lt
6394 root 20 0 2888 1232 1028 R 24.5 0.1 2:58.58 0 sh

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 1:19 pm

Hi,

Except for UP and HZ=1000, everything else is pretty much turned off.
If you use a very recent kernel, the problem may not be visible like this
anymore.
It may be a bit easier to reproduce, if you change the end time t0 in lt.c
a little. Also try to start the busy loop first.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Friday, August 10, 2007 - 10:18 am

please first give me the debug data captured with the script above
(while the workload is running) - so that i can see the full picture
about what's happening. Thanks,

Ingo
-

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 1:48 pm

It might just have been cpufreq. That nearly hits everybody with cpufreq
unless you have a pstate invariant TSC; and that's pretty much
always the case on older laptops.

It used to not be that drastic, but since i386 switched to the

The rewritten sched_clock() i still have queued does just that. I planned
to submit it for .23, but then during later in deepth testing
on my machine park I found a show stopper that I couldn't fix on time.
Hopefully for .24

-Andi
-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 3:26 am

I haven't been able to reproduce this with any combination of features,
and massive_intr tweaked to his work/sleep cycle. I notice he's
collecting stats though, and they look funky. Recompiling.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Roman Zippel <zippel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 3:30 am

yeah, the posted numbers look most weird, but there's a complete lack of
any identification of test environment - so we'll need some more word
from Roman. Perhaps this was run on some really old box that does not
have a high-accuracy sched_clock()? The patch below should simulate that
scenario on 32-bit x86.

Ingo

Index: linux/arch/i386/kernel/tsc.c
===================================================================
--- linux.orig/arch/i386/kernel/tsc.c
+++ linux/arch/i386/kernel/tsc.c
@@ -110,7 +110,7 @@ unsigned long long native_sched_clock(vo
* very important for it to be as fast as the platform
* can achive it. )
*/
- if (unlikely(!tsc_enabled && !tsc_unstable))
+// if (unlikely(!tsc_enabled && !tsc_unstable))
/* No locking but a rare wrong value is not a big deal: */
return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);

-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 3:36 am

Ah, thanks. I noticed that clocksource= went away. I'll test with
stats, with and without jiffies resolution.

-Mike

-

To: Ingo Molnar <mingo@...>
Cc: Roman Zippel <zippel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 4:49 am

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6465 root 20 0 1432 356 296 R 30 0.0 1:02.55 1 chew
6462 root 20 0 1576 216 140 R 23 0.0 0:50.29 1 massive_intr_x
6463 root 20 0 1576 216 140 R 23 0.0 0:50.23 1 massive_intr_x
6464 root 20 0 1576 216 140 R 23 0.0 0:50.28 1 massive_intr_x

Well, jiffies resolution clock did upset fairness a bit with a right at
jiffies resolution burn time, but not nearly as bad as on Roman's box,
and not in favor of the sleepers. With the longer burn time of stock
massive_intr.c (8ms burn, 1ms sleep), lower resolution clock didn't
upset it.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
6511 root 20 0 1572 220 140 R 25 0.0 1:00.11 1 massive_intr
6512 root 20 0 1572 220 140 R 25 0.0 1:00.14 1 massive_intr
6514 root 20 0 1432 356 296 R 25 0.0 1:00.31 1 chew
6513 root 20 0 1572 220 140 R 24 0.0 1:00.14 1 massive_intr

-Mike

-

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, August 1, 2007 - 10:49 am

Hi Roman,

Took me most of today trying to figure out WTH you did in fs2.c, more
math and fundamental explanations would have been good. So please bear
with me as I try to recap this thing. (No, your code was very much _not_
obvious, a few comments and broken out functions would have made a world
of a difference)

So, for each task we keep normalised time

normalised time := time/weight

using Bresenham's algorithm we can do this prefectly (up until a renice
- where you'd get errors)

avg_frac += weight_inv

weight_inv = X / weight

avg = avg_frac / weight0_inv

weight0_inv = X / weight0

avg = avg_frac / (X / weight0)
= (X / weight) / (X / weight0)
= X / weight * weight0 / X
= weight0 / weight

So avg ends up being in units of [weight0/weight].

Then, in order to allow sleeping, we need to have a global clock to sync
with. Its this global clock that gave me headaches to reconstruct.

We're looking for a time like this:

rq_time := sum(time)/sum(weight)

And you commented that the /sum(weight) part is where CFS obtained its
accumulating rounding error? (I'm inclined to believe the error will
statistically be 0, but I'll readily accept otherwise if you can show a
practical 'exploit')

Its not obvious how to do this using modulo logic like Bresenham because
that would involve using a gcm of all possible weights.

What you ended up with is quite interesting if correct.

sum_avg_frac += weight_inv_{i}

however by virtue of the scheduler minimising:

avg_{i} - avg_{j} | i != j

this gets a factor of:

weight_{i}/sum_{j}^{N}(weight_{j})

( seems correct, needs more analysis though, this is very much a
statistical step based on the previous constraint. this might
very well introduce some errors )

resulting in:

sum_avg_frac += sum_{i}^{N}(weight_inv_{i} *
weight_{i}/sum_{j}^{N}(weight_{j}))

...

To: Peter Zijlstra <peterz@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Thursday, August 2, 2007 - 1:36 pm

Hi,

Thanks for the effort though. :)
I know I'm not the best explaining these things, so I really appreciate

I think I've sent you off into the wrong direction somehow. Sorry. :)

Let's ignore the average for a second, normalized time is maintained as:

normalized time := time * (2^16 / weight)

The important point is that I keep the value in full resolution of 2^-16
vsec units (vsec for virtual second or sec/weight, where every tasks gets
weight seconds for every virtual second, to keep things simpler I also
omit the nano prefix from the units for a moment). Compared to that CFS
maintains a global normalized value in 1 vsec units.
Since I don't round the value down I avoid the accumulating error, this
means that

time_norm += time_delta1 * (2^16 / weight)
time_norm += time_delta2 * (2^16 / weight)

is the same as

time_norm += (time_delta1 + time_delta2) * (2^16 / weight)

CFS for example does this

delta_mine = calc_delta_mine(delta_exec, curr->load.weight, lw);

in above terms this means

time = time_delta * weight * (2^16 / weight_sum) / 2^16

The last shift now rounds the value down and if one does that 1000 times
per second, the resolution of the value that is finally accounted to
wait_runtime is also reduced appropriately.

The other rounding problem is based on that this term

x * prio_to_weight[i] * prio_to_wmult[i] / 2^32

doesn't produce x for most values in that tables (the same applies to the
weight sum), so if we have chains, where the values are converted from one
scale to the other, a rounding error is produced. In CFS this happens now
because wait_runtime is maintained in nanoseconds and fair_clock is a
normalized value.

The problem here isn't that these errors might have a statistical
relevance, as they are usually completely overshadowed by measurement
errors anyway. The problem is that these errors exist at all, this means
they have to be compensated somehow, so that they don't accumulate over
time and then bec...

To: Roman Zippel <zippel@...>
Cc: Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 13, 2007 - 3:43 pm

Roman Zippel noticed inconsistency of the wmult table.

wmult[16] has a missing digit.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

diff --git a/kernel/sched.c b/kernel/sched.c
index 0559665..3332bbb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -750,7 +750,7 @@ static const u32 prio_to_wmult[40] = {
48356, 60446, 75558, 94446, 118058, 147573,
184467, 230589, 288233, 360285, 450347,
562979, 703746, 879575, 1099582, 1374389,
- 717986, 2147483, 2684354, 3355443, 4194304,
+ 1717986, 2147483, 2684354, 3355443, 4194304,
5244160, 6557201, 8196502, 10250518, 12782640,
16025997, 19976592, 24970740, 31350126, 39045157,
49367440, 61356675, 76695844, 95443717, 119304647,

-

To: <linux-kernel@...>
Cc: Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 2:18 am

[snip]

While we're at it, isn't the comment above the wmult table incorrect?
The multiplier is 1.25, meaning a 25% change per nice level, not 10%.

- Jim

-

To: James Bruce <bruce@...>
Cc: Thomas Gleixner <tglx@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 3:06 am

yes, the weight multiplier 1.25, but the actual difference in CPU
utilization, when running two CPU intense tasks, is ~10%:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8246 mingo 20 0 1576 244 196 R 55 0.0 0:11.96 loop
8247 mingo 21 1 1576 244 196 R 45 0.0 0:10.52 loop

so the first task 'wins' +10% CPU utilization (relative to the 50% it
had before), the second task 'loses' -10% CPU utilization (relative to
the 50% it had before).

so what the comment says is true:

* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage.

for there to be a ~+10% change in CPU utilization for a task that races
against another CPU-intense task there needs to be a ~25% change in the
weight.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 6:18 am

Hi,

As soon as you add another loop the difference changes again, while it's
always correct to say it gets 25% more cpu time (which I still think is a
little too much).

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 7:20 am

yep, and i'll add the relative effect to the comment too.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 7:58 am

Hi,

Why did you cut off the rest of the sentence?
To illustrate the problem a little different: a task with a nice level -20
got around 700% more cpu time (or 8 times more), now it gets 8500% more
cpu time (or 86.7 times more).
You don't think that change to the nice levels is a little drastic?

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 1:47 pm

Ingo, that _does_ sound excessive.

How about trying a much less aggressive nice-level (and preferably linear,
not exponential)?

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 2:12 pm

Hi,

I think the exponential increase isn't the problem. The old code did
approximate something like this rather crudely with the result that there
was a big gap between level 0 and -1.

Something like this:

echo 'for (i=-20;i<=20;i++) print i, " : ", 1024*e(l(2)*(-i/20*3)), "\n";' | bc -l

would produce a range similiar to the old code. Replacing the factor 3
with 4 would be IMO a more reasonable increase and had the advantage for
the user that it's easier to understand that every 5 levels the time a
process gets is doubled.

bye, Roman
-

To: Linus Torvalds <torvalds@...>
Cc: Roman Zippel <zippel@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 6:27 am

I actually like the extra range, it allows for a much softer punch of
background tasks even on somewhat slower boxen.

I've been testing CFS on my 1200 MHz lappy for some time and a strongly
niced kbuild leaves a very usable system.

The old scheduler would leave the thing rather jumpy. And while CFS
fully fixes the jumpyness, I just did a nice +13 (which should be
equivalent to the old schedulers nice +19 for my HZ) and did a nice +19
kbuild and I can definitely feel the difference between them.

Early CFS versions had an pretty aggressive nice range (0.1% for +19),
and that has been toned down based on feedback. The current levels seem
to work well, at least on my boxen.

- Peter

-

To: Peter Zijlstra <peterz@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 8:45 am

Hi,

The extra range is not really a problem, in

http://www.ussg.iu.edu/hypermail/linux/kernel/0707.2/0850.html

I suggested how we can have both.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 8:52 am

By breaking the UNIX model of nice levels. Not an option in my book.

-

To: Peter Zijlstra <peterz@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:26 am

Hi,

BTW what is the "UNIX model of nice levels"?

SUS specifies the limit via NZERO, which is defined as "Minimum Acceptable
Value: 20", I can't find any information that it must be 20.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:31 am

I have never encountered a UNIX where it is anything other than 20.
Convention (alas not specification) does dictate 20.

-

To: Peter Zijlstra <peterz@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:07 am

Hi,

Breaking user expectations of nice levels is?

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Peter Zijlstra <peterz@...>, Linus Torvalds <torvalds@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:48 am

_changing_ it is an option within reason, and we've done it a couple of
times already in the past, and even within CFS (as Peter correctly
observed) we've been through a couple of iterations already. And as i
mentioned it before, the outer edge of nice levels (+19, by far the most
commonly used nice level) was inconsistent to begin with: 3%, 5%, 9% of
nice-0, depending on HZ. So changing that to a consistent (and
user-requested) 1.5% is a much smaller change than you seem to make it
out to be. CFS itself is a far larger "change of expectations" than this
tweak to nice levels. So by your standard we could never change the
scheduler. (which your ultimate argument might be after all =B-)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, Linus Torvalds <torvalds@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 10:14 am

Hi,

Why do you constantly stress level 19? Yes, that one is special, all other

How old is CFS and how many users did it have so far? How many users has

The percentage levels are off by a factor of upto _seven_, sorry I fail

Careful, you make assertion about me, for which you have absolutely no
base, adding a smiley doesn't make this any funnier.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Peter Zijlstra <peterz@...>, Linus Torvalds <torvalds@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 12:02 pm

i constantly stress it for the reason i mentioned a good number of
times: because it's by far the most commonly used (and complained about)
nice level. =B-)

but because you are asking, i'm glad to give you some first-hand
historic background about Linux nice levels (in case you are interested)
and the motivations behind their old and new implementations:

nice levels were always so weak under Linux (just read Peter's report)
that people continuously bugged me about making nice +19 tasks use up
much less CPU time. Unfortunately that was not that easy to implement
(otherwise we'd have done it long ago) because nice level support was
historically coupled to timeslice length, and timeslice units were
driven by the HZ tick, so the smallest timeslice was 1/HZ.

In the O(1) scheduler (about 4 years ago) i changed negative nice levels
to be much stronger than they were before in 2.4 (and people were happy
about that change), and i also intentionally calibrated the linear
timeslice rule so that nice +19 level would be _exactly_ 1 jiffy. To
better understand it, the timeslice graph went like this (cheesy ASCII
art alert!):

A
\ | [timeslice length]
\ |
\ |
\ |
\ |
\|___100msecs
|^ . _
| ^ . _
| ^ . _
-*----------------------------------*-----> [nice level]
-20 | +19
|
|

so that if someone wants to really renice tasks, +19 would give a much
bigger hit than the normal linear rule would do. (The solution of
changing the ABI to extend priorities was discarded early on.)

This approach worked to some degree for some time, but later on with
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
we felt to be a bit excessive. Excessive _not_ because it's too small of
a CPU utiliza...

To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, Linus Torvalds <torvalds@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 20, 2007 - 11:03 am

Hi,

I guess I should be thankful now?
I'm curious why you post this now, after I "asked" about this. Most of the
information is either rather generic or not specific enough for the
problem at hand. If you had posted this information earlier, it had been
far more valueable as it could have been a nice base for a discussion.
But posting it this late I can't lose the feeling you're more interested

Not completely.

For negative nice levels you mentioned audio apps, but these aren't really
interested in a fair share, they would use the higher percentage only to
guarantee they get the amount of time they need independent of the
current load. I think they would be better served with e.g. a deadline
scheduler, which guarantees them an absolute time share not a relative
one.
On the other end with positive levels I more remember requests for
something closer to idle scheduling, where a process only runs when
nothing else is running.

So assuming we had scheduling classes for the above use cases, what other
reasons are left for such extreme nice levels?

My proposed nice levels have otherwise the same properties as yours (e.g.
being consistent). There is one propery you haven't commented on at all
yet. My proposed levels give the average use a far better idea what they
actually mean, i.e. that every 5 levels the process gets double/halve the
cpu time. This is IMO a considerable advantage.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:27 am

http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html

specifically:

"3.239 Nice Value

A number used as advice to the system to alter process scheduling.
Numerically smaller values give a process additional preference when
scheduling a process to run. Numerically larger values reduce the
preference and make a process less likely to run. Typically, a process
with a smaller nice value runs to completion more quickly than an
equivalent process with a higher nice value. The symbol {NZERO}
specifies the default nice value of the system."

The only expectation is that a process with a lower nice level gets more
time. Any other expectation is a bug.

-

To: Peter Zijlstra <peterz@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 9:58 am

Hi,

Yes, users are buggy, they expect a lot of stupid things...
Is this really reason enough to break this?

What exactly is the damage if setpriority() accepts a few more levels?

bye, Roman
-

To: Peter Zijlstra <peterz@...>
Cc: Roman Zippel <zippel@...>, Linus Torvalds <torvalds@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 8:59 am

yeah, that's pretty much out of question.

Ingo
-

To: Roman Zippel <zippel@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 8:12 am

(no need to become hostile, i answered to that portion of your sentence
separately, which was logically detached from the other portion of your

This was discussed on lkml in detail, see the CFS threads. It has been a
common request for nice levels to be more logical (i.e. to make them
universal and to detach them from HZ) and for them to be more effective
as well.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 8:42 am

Hi,

Could you please stop with these accusations?

Which are quite big, so I skipped most of it, a more precise pointer would

Huh? What has this to do with HZ? The scheduler used ticks internally, but
it's irrelevant to what the user sees via the nice levels.
So the question still stands that this change may be a little drastic, as
you changed the nice levels of _all_ users, not just of those who were
previously interested in CFS.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 9:40 am

unfortunately you are wrong again - there are various HZ related
artifacts in the nice level support code of the old scheduler.

v2.6.22, CONFIG_HZ=100, nice +19 task against a nice-0 CPU-intense task:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2446 mingo 25 0 1576 244 196 R 90.9 0.0 0:32.79 loop
2448 mingo 39 19 1580 248 196 R 9.1 0.0 0:02.94 loop

v2.6.22, CONFIG_HZ=250, nice +19 task against a nice-0 CPU-intense task:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2358 mingo 25 0 1576 248 196 R 96.1 0.0 0:31.97 loop_silent
2363 mingo 39 19 1576 244 196 R 3.9 0.0 0:01.24 loop_silent

v2.6.22, CONFIG_HZ=300, nice +19 task against a nice-0 CPU-intense task:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop_silent
2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop_silent

to sum it up: a nice +19 task (the most commonly used nice level in
practice) gets 9.1%, 3.9%, 3.1% of CPU time on the old scheduler,
depending on the value of HZ. This is quite inconsistent and illogical.

this HZ dependency of nice levels existed for many years, and the new
scheduler solves that inconsistency - every nice level will get the same
amount of time, regardless of HZ.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 10:01 am

Hi,

You're correct that you can find artifacts in the extreme cases, it's
subjective whether this is a serious problem.
It's nice that these artifacts are gone, but that still doesn't explain
why this ratio had to be increase that much from around 1:10 to 1:69.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 4:31 pm

More dynamic range is better? If you actually want a task to get 20x
the CPU time of another, the older scheduler doesn't really allow it.

Getting 1/69th of a modern CPU is still a fair number of cycles.
Nevermind 1/69th of a machine with > 64 cores.

--
Mathematics is the supreme nostalgia of our time.
-

To: Matt Mackall <mpm@...>
Cc: Ingo Molnar <mingo@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 5:25 pm

Hi,

You can already have that, the complete range level from 19 to -20 was
about 1:80.
There is also something like too much range, I tried it with top at 19 and
as soon as something runs at -20 it's practically dead, because it gets
now only 1/5900 of cpu time.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Tuesday, July 17, 2007 - 3:53 am

But that is irrelevant: all tasks start out at nice 0, and what matters
is the dynamic range around 0.

So the dynamic range has been made uniform in the positive from
1:10...1:20...1:30 to 1:69 for nice +19, and from 1:8 to 1:69 in the
minus. (with 1:86 nice -20) If you look at the negative nice levels
alone it's a substantial increase but if you compare it with positive
nice levels you'll similar kinds of dynamic ranges were already present
in the old scheduler and you'll see why we've done it.

Negative nice levels are admin-controlled, the increase in the negative
levels is is not a big issue and people actually like the increased
dynamic range and the consistency. The positive range _might_ be a
bigger issue but there we were largely inconsistent anyway, and again,
people like the increased dynamic range.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Tuesday, July 17, 2007 - 11:12 am

Hi,

So let's look at them:

for (i=0;i<20;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n";
0 : 100 : 100 : 100.00000000000000000000
1 : 95 : 80.00000000000000000000 : 87.05505632961241391300
2 : 90 : 64.00000000000000000000 : 75.78582832551990411700
3 : 85 : 51.20000000000000000000 : 65.97539553864471296900
4 : 80 : 40.96000000000000000000 : 57.43491774985175034000
5 : 75 : 32.76800000000000000000 : 50.00000000000000000000
6 : 70 : 26.21440000000000000000 : 43.52752816480620695700
7 : 65 : 20.97152000000000000000 : 37.89291416275995205900
8 : 60 : 16.77721600000000000000 : 32.98769776932235648400
9 : 55 : 13.42177280000000000000 : 28.71745887492587517000
10 : 50 : 10.73741824000000000000 : 25.00000000000000000000
11 : 45 : 8.58993459200000000000 : 21.76376408240310347800
12 : 40 : 6.87194767360000000000 : 18.94645708137997602900
13 : 35 : 5.49755813888000000000 : 16.49384888466117824200
14 : 30 : 4.39804651110400000000 : 14.35872943746293758500
15 : 25 : 3.51843720888320000000 : 12.50000000000000000000
16 : 20 : 2.81474976710656000000 : 10.88188204120155173900
17 : 15 : 2.25179981368524800000 : 9.47322854068998801400
18 : 10 : 1.80143985094819840000 : 8.24692444233058912100
19 : 5 : 1.44115188075855872000 : 7.17936471873146879200

(nice level : old % : new % : my suggested %)

Your levels divert very quickly from what they used to be (upto a factor
of 7), it's also not really easy to remember what the individual levels
mean.
I at least try to keep them somewhat in the range they used to be (and
the difference is limited to a factor of about 2), also every 5 levels the
amount of cpu time is halved, which is very easy to remember.

If you need more dynamic range, is there a law that prevents us from going
beyond 19? For example:

for (i=20;i<=30;i++) print i, " : ", (20-i)*5, " : ", 100*1.25^-i, " : ", e(l(2)*(-i/5))*100, "\n";
20 : 0 : 1.15292150460684697600 : 6.25000000000000000000
21 : -5 : .92233720368547758000 : 5.44094102...

To: Matt Mackall <mpm@...>
Cc: Roman Zippel <zippel@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 5:18 pm

yeah. furthermore, nice -20 is only admin-selectable.

Here are the current CPU-use values for positive nice levels:

nice 0: 100.00%
nice 1: 80.00%
nice 2: 64.10%
nice 3: 51.28%
nice 4: 40.98%
nice 5: 32.78%
nice 6: 26.24%
nice 7: 21.00%
nice 8: 16.77%
nice 9: 13.42%
nice 10: 10.74%
nice 11: 8.59%
nice 12: 6.87%
nice 13: 5.50%
nice 14: 4.39%
nice 15: 3.51%
nice 16: 2.81%
nice 17: 2.25%
nice 18: 1.80%
nice 19: 1.44%

here's the CPU utilization table for negative nice levels (relative to a
nice -20 task):

nice 0: 1.15%
nice -1: 1.44%
nice -2: 1.80%
nice -3: 2.25%
nice -4: 2.81%
nice -5: 3.51%
nice -6: 4.39%
nice -7: 5.50%
nice -8: 6.87%
nice -9: 8.59%
nice -10: 10.74%
nice -11: 13.42%
nice -12: 16.77%
nice -13: 21.00%
nice -14: 26.24%
nice -15: 32.78%
nice -16: 40.98%
nice -17: 51.28%
nice -18: 64.10%
nice -19: 80.00%
nice -20: 100.00%

these are pretty sane, and symmetric across the origo. Nice -20 is the
odd one out, because there is no nice +20. But its value is still
logical, it's the mirror image of an imaginery nice +20.

and note that even on the old scheduler, nice-0 was "3200% more
powerful" than nice +19 (with CONFIG_HZ=300), and nice -19 was only 700%
more powerful than nice-0. So not only was it inconsistent (and i can
create scary numbers too ;), it gave the admin-controlled negative nice
levels less of a punch than to user-controlled nice +19. A number of
people complainted about that, and CFS addresses this.

in fact i like it that nice -20 has a slightly bigger punch than it used
to have before: it might remove the need to run audio apps (and other
multimedia apps) under SCHED_FIFO. (SCHED_FIFO is unprotected against
lockups, while under CFS a nice 0 task is still starvation protected
against a nice -20 task.)

furthermore, there is a quality of implementation issue as well, look at
the definition of the nice system call:

asmlinkage long sys_nice(int increm...

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 6:13 pm

Hi,

How did you get that value? At any HZ the ratio should be around 1:10

"Slightly bigger"??? You're joking, right?
Especially the user levels are doing something completely different now,
which may break user expectation. While the user couldn't expect anything
precise, it's still a big difference whether a process at nice 5 gets 75%
of the time or only 30%.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 6:29 pm

you are wrong again. I sent you the numbers earlier today already:

| PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
| 2332 mingo 25 0 1580 248 196 R 95.1 0.0 0:11.84 loop
| 2335 mingo 39 19 1576 244 196 R 3.1 0.0 0:00.39 loop

3.1% is 3067% more than 95.1%, and the ratio is 1:30.67. You again deny
above that this is the case, and there's nothing i can do about your
denial of facts - that is your own private problem.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 8:02 pm

Hi,

Ingo, how am I supposed to react to this? I'm asking a simple question
and I get this? I'm at serious loss how to deal with you. :-(

Above is based on theoritical values, for a 300HZ kernel these two
processes should get 30 and 3 ticks. Should there be any rounding error or
off by one error so that the processes get one tick less than they should
get or one tick is accounted to the wrong process, my theoritical value is
still within the possible error range and doesn't contradict your
practical values.
Playing around with some other nice levels, confirms the theory that
something is a little off, so I'm quite correct at saying that the ratio
_should_ be 1:10.
OTOH you are the one who is wrong about me (again). :-(

bye, Roman
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 11:20 pm

Hi,

Rechecking everything there was actually a small error in my test program,
so the ratio should be at 1:20. Sorry about that mistake.
Nice level 19 shows the largest artifacts, as that level only gets a
single tick, so the ratio is often 1:HZ/10 (except for 1000HZ where it's
5:100). Nevertheless it's still true that in general nice levels were
independent of HZ (that's all I wanted to say a couple of mails ago).

Ingo, you can start now gloating, but contrary to you I have no problems
with admitting mistakes and apologizing for them. The point is just that
I'm reacting better to factual arguments instead of flames (and I think

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Tuesday, July 17, 2007 - 4:02 am

Roman, please do me a favor, and ask me the following question:

" Ingo, you've been maintaining the scheduler for years. In fact you
wrote the old nice code we are talking about here. You changed it a
number of times since then. So you really know what's going on here.
Why does the old nice code behave like that for nice +19 levels? "

I've been waiting for that obvious question, and i _might_ be able to
answer it, but somehow it never occured to you ;-) Thanks,

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Tuesday, July 17, 2007 - 10:06 am

Hi,

Do you have any idea how insulting and arrogant this is?
Let me translate for you, how this arrived:

"O Ingo, who art our god of the scheduler. You have blessed the paths I
walked in. You kept me from sinning numerous times. Your wisdom is
infinite. Guide me on the journey that layeth ahead of me into this world
knowledge of Your truth."

(I apologize already in advance, if I should have hurt anyones religious
feelings.)

It's obvious that you have more experience with the scheduler code, but
does that make you unfailable? Does that give you the right to act like a
jerk?
I do make mistakes, I try to learn from them and life goes on, I have no
problem with that, but what I have a problem with is if someone is abusing
this to his own advantage. I have to be extremely carful what I say to
you, because you jump on the first small mistake and I have to bear your
insults like "there's nothing i can do about your denial of facts - that
is your own private problem." I have no problems with facts, I'm only
trying very hard to ignore your arrogant behaviour...
If you have something to contribute to this discussion which might clear
things up, then just say it, but I'm not going to beg for it.

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 6:40 am

Roman, it is really not about 'experience', and yes, we all make
frequent mistakes.

it's about the plain fact that i happened to write _both_ the old and
the new code you were talking about all along. In this discussion about
nice levels you were (very) agressively asserting things that were
untrue, you were suggesting that i dont understand the code, instead of
simply asking me why the code was written in such a way and what the
motivation behind it was. I'd be glad to attempt to answer such a
friendly question, if you are interested in asking it and if you are
interested in my answer. Thanks,

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 8:40 am

Hi,

Instead of simply asserting things, how about you provide some examples?
I made so far a single mistake of mixing up nice levels 18 and 19.
If you would point me to such examples, I could learn how to tone it down
a little, since the nice levels are not the only issue I have with the new
scheduler, the heavy stuff is still about to come. The problem here is
there is too much burnt ground so I can't just present raw ideas, which
get flamed by you, I have to be sufficiently confident they are valid,

Again, please point me to examples, so I at least have a chance to clear
things up, since it was never my intention to make such a suggestion, but
this gives me no chance to defend myself.

OTOH I can tell you exactly how you continuously insult me, e.g. by
suggesting I ask "stupid questions" or that I'm in "denial of facts".
Don't make such suggestions if you have no idea how insulting they are.
Especially the one deleted insult above where you have the impertinence to
quote it, such tone is more appropriate between lord and inferior, where
the latter have to make a request and the former "might" grant it.
_Never_ make me beg. :-(

bye, Roman
-

To: Roman Zippel <zippel@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 18, 2007 - 12:17 pm

uhm, [and the uninterested reader might want to skip to the next mail

the ";-)" emoticon (and its contents) clearly signals this as a
sarcastic, tongue-in-cheek remark. To make it even clearer, please

ok? (If you didnt see/read it as sarcastic straight away then my
apologies for insulting you!)

The "_might_ be able to answer" bit is of course sarcastic too, and
contrary to your (i have to say, pretty absurd) suggestion i did not
suggest that i "might be _willing_ to answer" - which would be quite
arrogant indeed and which i never said or suggested. To make it even
clearer: i'm definitely able to answer questions about code i wrote
originally and which i just changed, were you to show genuine interest
in hearing my opinion :-)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Matt Mackall <mpm@...>, James Bruce <bruce@...>, Thomas Gleixner <tglx@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Friday, July 20, 2007 - 9:38 am

Hi,

To take another example why is this still insulting and inappropriate,
this is a behaviour I would characterize as school bullying:
A bully attacks someone obviously weaker than himself and for example
takes something away and than continues like "If you ask nicely I'll give
it back to you.", this often accompied by laughter to signal he's enjoying
himself and the power he has, but for the other person it's everything but
funny.

Maybe you don't know what it feels like, but I do and I can't find
anything funny, sarcastic or whatever about this, no matter how many
smileys or other tags you add there. If the communication is already that
troubled as this, such "humor" is really the worst thing you can do and I

Sorry, that is too little too late. You've apologized before and you
continued to make fun of me personally to the point of spreading wrong
information about me, which you could have very easily verified yourself,
if you only wanted.
What I want from you is that you treat me with respect and to keep your
"sarcasm" to yourself.

I told you very clearly how I think about you requoting this crap and yet
you repeat it again _twice_, so on the one hand I get this apology attempt
and on the other hand you continue to kick me in the crotch? How do you
think am I supposed to feel about this?

It's also always interesting what you don't respond to. I asked you for
examples which would prove the (rather strong) assertions you made about
me, what does it tell me now if you can't back up your statements?

bye, Roman
-

To: James Bruce <bruce@...>
Cc: Thomas Gleixner <tglx@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 3:41 am

in any case more documentation is justified, so i've added some
clarification to the comments - see the patch below.

Ingo

------------------------>
Subject: sched: improve weight-array comments
From: Ingo Molnar <mingo@elte.hu>

improve the comments around the wmult array (which controls the weight
of niced tasks). Clarify that to achieve a 10% difference in CPU
utilization, a weight multiplier of 1.25 has to be used.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -736,7 +736,9 @@ static void update_curr_load(struct rq *
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
- * it's +10% CPU usage.
+ * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
+ * If a task goes up by ~10% and another task goes down by ~10% then
+ * the relative distance between them is ~25%.)
*/
static const int prio_to_weight[40] = {
/* -20 */ 88818, 71054, 56843, 45475, 36380, 29104, 23283, 18626, 14901, 11921,
-

To: Ingo Molnar <mingo@...>
Cc: Thomas Gleixner <tglx@...>, Roman Zippel <zippel@...>, Mike Galbraith <efault@...>, Linus Torvalds <torvalds@...>, Andrea Arcangeli <andrea@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Chris Wright <chrisw@...>
Date: Monday, July 16, 2007 - 11:02 am

Ah ok so it's 10% of the original CPU usage, not relative to a tasks
share from before. While I guess I still think in terms of relative CPU
share, your comments now make sense to me. Thanks for the
clarification.

-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:46 pm

Andi,

You promised privately to do a thorough review as well, which I'm still

There is no step by step thing. You convert an arch to clock events or

If you have technical objections, put them on the table. Point by point.

All I heard so far from you are platitudes, which are not worth the
electrons to transport them.

tglx

-

To: Thomas Gleixner <tglx@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 6:18 pm

I did some reviewing, but never the big write up and feedback. That was my
fault, sorry.

-Andi
-

To: Thomas Gleixner <tglx@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:52 pm

Indeed, about the only thing can be done is to take a slower approach
to converging the arch specific implementations (hpet, pit, etc).

thanks,
-chris
-

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:02 pm

I can understand being disappointed, but not quite as upset as you
appear to be.

so have you (Ingo) reviewed the ext4 patches? or reiser4 patches?
or lumpy reclaim? or anti-fragmentation?

I certainly haven't. I can barely keep up with reading about 1/2
of lkml emails. And in my non-scientific method, I think that we
are suffering from both (a) more patch submittals and (b) fewer
qualified reviewers (per kernel KLOC) than we had 3-5 years ago.

I don't see how you can expect Andrew to review these or any other
specific patchset. Do you have some suggestions on how to clone
Andrew?

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Randy Dunlap <rdunlap@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 5:39 pm

Randy,

Ingo was talking to Andi, the x86_64 maintainer, not to Andrew.

And I share his opinion that the maintainer of the subsystem, which is
affected by such a fundamental patch, could have at least shown any
public sign of interest, disgust, comment or what ever in a 3+ month
time frame.

Especially about a patch, which is a logical consequence of an almost
two years public and transparent effort to consolidate the time code in
the kernel.

I for my part have no problem maintaining the set for another round out
of tree and weed out eventually problems in -mm, but my expectation for
qualified response of the responsible maintainer is exactly zero right
now.

Thanks,

tglx

-

To: Thomas Gleixner <tglx@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Arjan van de Ven <arjan@...>, Linus Torvalds <torvalds@...>, Chris Wright <chrisw@...>
Date: Wednesday, July 11, 2007 - 7:21 pm

Yep, I see that when I re-read it. I apologize.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 1:33 pm

For the mtrr trim patch at least, I think the coverage we've received
in -mm is probably sufficient (the failure mode would be fairly
obvious). The only thing I'm nervous about is adding AMD support for
the quirk, since I don't have any way of testing it. We can easily add
that later though, if a tester steps forward or we see demand for it
(should just be an extra conditional in the trim code).

Thanks,
Jesse
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Linux Memory Management <linux-mm@...>
Date: Wednesday, July 11, 2007 - 8:54 pm

The more work may turn out being too much for you (although it is nothing
exactly tricky that would introduce subtle bugs, it is a fair amont of churn).

However, in that case we can still merge these two:

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch

Which fix real bugs that need fixing (and will at least help to get some of
my patches off your hands).

--
SUSE Labs, Novell Inc.
-

To: Nick Piggin <nickpiggin@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Linux Memory Management <linux-mm@...>, linux-fsdevel <linux-fsdevel@...>, xfs-oss <xfs@...>
Date: Wednesday, July 11, 2007 - 10:31 pm

OK, so does that mean we can finally get the block_page_mkwrite
patches merged?

i.e.:

http://marc.info/?l=linux-kernel&m=117426058311032&w=2
http://marc.info/?l=linux-kernel&m=117426070111136&w=2

I've got up-to-date versions of them ready to go and they've been
consistently tested thanks to the XFSQA test I wrote for the bug
that it fixes. I've been holding them out-of-tree for months now
because ->fault was supposed to supercede this interface.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

To: David Chinner <dgc@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Linux Memory Management <linux-mm@...>, linux-fsdevel <linux-fsdevel@...>, xfs-oss <xfs@...>
Date: Wednesday, July 11, 2007 - 10:42 pm

Yeah, as I've said, don't hold them back because of me. They are
relatively simple enough that I don't see why they couldn't be
merged in this window.

--
SUSE Labs, Novell Inc.
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, netdev <netdev@...>, Tejun Heo <htejun@...>
Date: Tuesday, July 10, 2007 - 1:42 pm

(just to provide my indicator of status)

see Alan's comments. I've been ignoring pata_acpi for a while, because

are other pata_platform people happy with this? I don't know embedded

should be combined, really. will merge eventually. basic concept OK,

Needs a bug fix, so that the newly modified loop doesn't scan the final

Any of the above worth 2.6.23? Just wondering if they were useful

Just the general march of progress on new hardware :)

I would like to see this support merged in /some/ form. We've been
telling Intel for years they were sillyheads for not bothering with an
IOMMU. Now that they have, we should give them a cookie and support
good technology.

Jeff

-

To: Jeff Garzik <jeff@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <netdev@...>
Date: Tuesday, July 10, 2007 - 3:56 pm

Hello.

Now that the fix for CONFIG_PCI=n has been merged, what's left is to test

You should have, I was sending it to you.

WBR, Sergei
-

To: Jeff Garzik <jeff@...>
Cc: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, netdev <netdev@...>, Tejun Heo <htejun@...>, Alan Cox <alan@...>, Deepak Saxena <dsaxena@...>, Dan Faerch <dan@...>, Benjamin LaHaise <bcrl@...>
Date: Tuesday, July 10, 2007 - 2:24 pm

On Tue, 10 Jul 2007 13:42:16 -0400

This is just a silly remove-unneeded-cast-of-void* cleanup. I wrote this
as a fixup against
libata-add-irq_flags-to-struct-pata_platform_info.patch with the intention
of folding it into that base patch, but you went and merged the submitter's
original patch so this trivial fixup got stranded in -mm. Feel free to give

Oh, I thought these were the patches which affected scsi and which James

3x59x-fix-pci-resource-management.patch: you wrote it ;) I have a comment
here:

- I don't remember the story with cardbus either. Presumably once upon a
time the cardbus layer was claiming IO regions on behalf of cardbus
devices (?)

Need to think about that.

update-smc91x-driver-with-arm-versatile-board-info.patch:

See comment from rmk in changelog:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6...

Deepak, can we move this along a bit please?

drivers-net-ns83820c-add-paramter-to-disable-auto.patch:

See comments in changelog: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6...

The first few patches will a) fix up our writev performance regression and
b) reintroduce the writev() deadlock which the writev()-regresion-adding
patch fixed.

OK, thanks.
-

To: Andrew Morton <akpm@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, netdev <netdev@...>, Tejun Heo <htejun@...>, Alan Cox <alan@...>, Deepak Saxena <dsaxena@...>, Benjamin LaHaise <bcrl@...>
Date: Wednesday, July 11, 2007 - 12:47 pm

Mmm.. Ben had 2 comments last year:

I know very little about hardware and only own the fiber version of this
card. Even if i tried to make code for the copper version, it would
probably blow it up the phy and set the switches on fire ;).

This is pretty much Russian to me.
I wouldnt know where to find the "link-autonegotiation-state-machine-for-fibre-cards" or know what to do with it anyway :).

The "disable_autoneg" is a convenient feature (for me and the other guy who made the same patch last year) and i consider it a harmless feature in every way.
It is simply an 'if'-statement, that skips the "start autoneg" function upon load.
We can simply remove the feature entirely if it is deemed undesirable.

So in conclusion:
- I vote "use the patch as-is", but im fine with it being changed.
- If it needs support for copper, someone else has to code it.

Regards
- Dan

-

To: Andrew Morton <akpm@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>
Date: Tuesday, July 10, 2007 - 4:31 pm

Hello.

WBR, Sergei
-

To: Sergei Shtylyov <sshtylyov@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, <netdev@...>
Date: Tuesday, July 10, 2007 - 4:35 pm

On Wed, 11 Jul 2007 00:31:23 +0400

yup, that's what "I have a comment" meant ;)

The comment seems rather bogus actually. Let's just merge it.

-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, netdev <netdev@...>, Tejun Heo <htejun@...>, Alan Cox <alan@...>, Deepak Saxena <dsaxena@...>, Dan Faerch <dan@...>, Benjamin LaHaise <bcrl@...>
Date: Tuesday, July 10, 2007 - 2:57 pm

I'm sorry, I didn't look closely enough. I was referring to the

hrm. ISTR James wanted some cleanups, Kristen did some cleanups, then
looking at the cleanups decided they were needed / appropriate at this time.

Anyway, these are in my mbox queue and the libata portions (of which the
code is the majority) seem OK. Need to give them a final review.

Jeff

-

To: Andrew Morton <akpm@...>
Cc: Jeff Garzik <jeff@...>, <linux-kernel@...>, IDE/ATA development list <linux-ide@...>, netdev <netdev@...>, Tejun Heo <htejun@...>, Alan Cox <alan@...>, Deepak Saxena <dsaxena@...>, Dan Faerch <dan@...>, Benjamin LaHaise <bcrl@...>
Date: Tuesday, July 10, 2007 - 2:55 pm

Well ... my concern was really how to make them more generic ... ahci
isn't the only controller that can do phy power management, and it also
seemed to me that the most generic entity for power management was the
transport rather than the SCSI mid-layer, but that debate is still
ongoing.

James

-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Nelson, Shannon <shannon.nelson@...>, Leech, Christopher <christopher.leech@...>
Date: Tuesday, July 10, 2007 - 12:31 pm

Chris is a moving target. Thankfully we have Shannon Nelson taking over Chris'
duties. Shannon, can you take a look at these and see what needs to happen to it
? Most likely these just need to be pushed to the right person.

Cheers,

Auke

PS: I think we should add an I/OAT / DMA engine section in the MAINTAINERS...
-

To: Kok, Auke-jan H <auke-jan.h.kok@...>, Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Leech, Christopher <christopher.leech@...>
Date: Tuesday, July 10, 2007 - 2:05 pm

Auke: Thanks for the introduction :-).

Andrew: All three of these patches are reasonable and can be pushed on
up. You can add my sign-off to all three:

I'll be posting a MAINTAINERS patch Real Soon Now with my name on
IOAT/DMA.

sln
======================================================================
Mr. Shannon Nelson LAN Access Division, Intel Corp.
Shannon.Nelson@intel.com I don't speak for Intel
(503) 712-7659 Parents can't afford to be squeamish.
-

To: Nelson, Shannon <shannon.nelson@...>
Cc: Kok, Auke-jan H <auke-jan.h.kok@...>, <linux-kernel@...>, Leech, Christopher <christopher.leech@...>
Date: Tuesday, July 10, 2007 - 2:47 pm

On Tue, 10 Jul 2007 11:05:45 -0700

OK, the way it works is that I send these patches at the git tree
maintainer, then the git tree maintainer merges them (this step is
unreliable) and then when I repull that git tree maintainer's tree I see
that they got merged so I drop them from -mm. The git tree maintainer
decides when to send them to Linus.

I am presently pulling git://lost.foo-projects.org/~cleech/linux-2.6#master
into -mm.

Will you be taking over the IOAT git tree? If so, please send me a
suitable git URL when it's ready.

The above tree has several changes in it from January (see
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6...).
Please take a look at those, work out what we should do with it all.

-

To: Andrew Morton <akpm@...>
Cc: Kok, Auke-jan H <auke-jan.h.kok@...>, <linux-kernel@...>, Leech, Christopher <christopher.leech@...>
Date: Tuesday, July 10, 2007 - 5:18 pm

I'll be getting there Real Soon Now. The transition seems to be a

Will do. Thanks for your patience.

sln
======================================================================
Mr. Shannon Nelson LAN Access Division, Intel Corp.
Shannon.Nelson@intel.com I don't speak for Intel
(503) 712-7659 Parents can't afford to be squeamish.
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, Christoph Lameter <clameter@...>
Subject: clam
Date: Tuesday, July 10, 2007 - 8:37 am

Andrew Morton wrote:

The lumpy reclaim patches originally came out of work to support Mel's
anti-fragmentation work. As such I think they have become somewhat
attached to those patches. Whilst lumpy is most effective where
placement controls are in place as offered by Mel's work, we see benefit
from reduction in the "blunderbuss" effect when we reclaim at higher
orders. While placement control is pretty much required for the very
highest orders such as huge page size, lower order allocations are
benefited in terms of lower collateral damage.

There are now a few areas other than huge page allocations which can
benefit. Stacks are still order 1. Jumbo frames want higher order
contiguous pages for there incoming hardware buffers. SLUB is showing
performance benefits from moving to a higher allocation order. All of
these should benefit from more aggressive targeted reclaim, indeed I
have been surprised just how often my test workloads trigger lumpy at
order 1 to get new stacks.

Truly representative work loads are hard to generate for some of these.
Though we have heard some encouraging noises from those who can
reproduce these problems.

[...]

-apw
-

To: Andy Whitcroft <apw@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, Mel Gorman <mel@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, July 11, 2007 - 5:34 am

[Seems a PEBKAC occured on the subject line, resending lest it become a
victim of "oh thats spam".]

-

To: Andy Whitcroft <apw@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, July 11, 2007 - 12:46 pm

I'd expect that the main application for lumpy-reclaim is in keeping a pool
of order-2 (say) pages in reserve for GFP_ATOMIC allocators. ie: jumbo
frames.

At present this relies upon the wakeup_kswapd(..., order) mechanism.

How effective is this at solving the jumbo frame problem?

(And do we still have a jumbo frame problem? Reports seems to have subsided)
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, Mel Gorman <mel@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, July 11, 2007 - 2:38 pm

The tie in between allocator and kswapd is essentially unchanged,
so if allocators are dropping below the watermarks at the specified
order, reclaim will be triggered at that order. Reclaim continues
until we return above the high watermarks, at the order at which
we are reclaiming.

What lumpy brings is a greater targetting of effort to get the pages.
kswapd now uses the desired allocator order when applying reclaim.
This leads to pressure being applied to contigious areas at the
required order, and so a higher chance of that order becoming
available. Traditional reclaim could end up applying pressure to
a number of pages, but not all pages in any area at the required
order, leading to a very low chance of success. By targetting
areas at the required order we significantly increase the chances
of success for any given amount of reclaim. As we will reclaim
until we have the desired number of free pages, we will have to
reclaim less to achieve this compared to random reclaim.

This certainly is appealing intuitivly, and our testing at higher
orders shows that the cost of each reclaimed page is lower and more
importantly the time to reclaim each page is reduced. So for a
'continuing' consumer like an incoming packet stream, we should
have to do much less work and thus disrupt the system as a whole
much less to get its pages.

Where demand for atomic higher order pages is not heavy we would
expect kswapd to maintain free levels pages more readily and so
under higher demand. Though it should be stressed without placement
control success rates drop off significantly at higher orders as
the probabality of reclaim succeeding on all pages in the area
subsided)

It is not in the least bit clear if the problem is resolved or if the
reporters have simply gone quiet.

Overall the approach taken in lumpy reclaim seems to be a logical
extension of the regular reclaim algorithm, leading to more
efficient reclaim.

-apw
-

To: Andrew Morton <akpm@...>
Cc: Andy Whitcroft <apw@...>, <linux-kernel@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Monday, July 16, 2007 - 6:37 am

The patches have an application with hugepage pool resizing.

When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
be resized with greater reliability. Testing on a desktop machine with 2GB
of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
was very slow as the success rate was quite low. Without lumpy-reclaim, each
attempt to grow the pool by 100 pages would yield 1 or 2 hugepages. With
lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-

To: <linux-kernel@...>
Cc: Christoph Hellwig <hch@...>, Al Viro <viro@...>, Miklos Szeredi <miklos@...>
Date: Tuesday, July 17, 2007 - 4:55 am

ping.
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <rusty@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 8:23 am

> lguest-export-symbols-for-lguest-as-a-module.patch

__put_task_struct is one of those no way in hell should this be exported
things because we don't want modules messing with task lifetimes.

Fortunately I can't find anything actually using this in lguest, so
it looks the issue has been solved in the meantime.

I also have a rather bad feeling about exporting access_process_vm.
This is the proverbial sledge hammer for access to user vm addresses
and I'd rather keep it away from module programmers with "if all
you have is a hammer ..." in mind.

In lguest this is used by send_dma which from my short reading of the
code seems to be the central IPC mechanism. The double copy here
doesn't look very efficient to me either. Maybe some VM folks could
look into a better way to archive this that might be both more

Just started to reading this (again) so no useful comment here, but it
would be nice if the code could follow CodingStyle and place the || and
&& at the end of the line in multiline conditionals instead of at the
beginning of the new one.
-

To: Christoph Hellwig <hch@...>
Cc: <linux-kernel@...>, <rusty@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 2:04 pm

On Wed, 11 Jul 2007 14:23:24 +0200

Ther are a couple of calls to put_task_struct() in there, and that needs

hm, well, access_process_vm() is a convenience wrapper around
-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <rusty@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 11:45 am

On Wed, 11 Jul 2007 14:23:24 +0200 Christoph Hellwig wrote:

I prefer them at the ends of lines also, but that's not in CodingStyle,
it's just how we do it most of the time (so "coding style", without
caps).

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 9:21 pm

To do inter-guest (ie. inter-process) I/O you really have to make sure

It's not a double copy: it's a map & copy.

If KVM develops inter-guest I/O then this could all be extracted into a

Surprisingly, you have a point here. Since the key purpose of lguest is
as demonstration code, it meticulously match kernel style.

I shall immediately prepare a patch to convert the rest of the kernel to
the correct "&& at beginning of line" style.

Rusty.

-

To: <rusty@...>
Cc: <hch@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 10:28 pm

From: Rusty Russell <rusty@rustcorp.com.au>

You should just let it exit and when it does you receive some kind of
exit notification that resets your virtual device channel.

I think the reference counting approach is error and deadlock prone.
Be more loose and let the events reset the virtual devices when
guests go splat.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 10:48 pm

There are two places where we grab task refcnt. One might be avoidable
(will test and get back) but the deferred wakeup isn't really:

/* We cache one process to wakeup: helps for batching & wakes outside locks. */
void set_wakeup_process(struct lguest *lg, struct task_struct *p)
{
if (p == lg->wake)
return;

if (lg->wake) {
wake_up_process(lg->wake);
put_task_struct(lg->wake);
}
lg->wake = p;
if (lg->wake)
get_task_struct(lg->wake);
}

We drop the lock after I/O, and then do this wakeup. Meanwhile the
other task might have exited.

I could get rid of it, but I don't think there's anything wrong with the
code...

Cheers,
Rusty.

-

To: Rusty Russell <rusty@...>
Cc: David Miller <davem@...>, <hch@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, July 12, 2007 - 12:24 am

<handwaving>

We seem to be taking the reference against the wrong thing here. It should
be against the mm, not against a task_struct?
-

To: Andrew Morton <akpm@...>
Cc: David Miller <davem@...>, <hch@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, July 12, 2007 - 12:52 am

This is solely for the wakeup: you don't wake an mm 8)

The mm reference is held as well under the big lguest_mutex (mm gets
destroyed before files get closed, so we definitely do need to hold a
reference).

I just completed benchmarking: the cached wakeup with the current naive
drivers makes no difference (at one stage I was playing with batched
hypercalls, where it seemed to help).

Thanks Christoph, DaveM!
===
Remove export of __put_task_struct, and usage in lguest

lguest takes a reference count of tasks for two reasons. The first is
bogus: the /dev/lguest close callback will be called before the task
is destroyed anyway, so no need to take a reference on open.

The second is code to defer waking up tasks for inter-guest I/O, but
the current lguest drivers are too simplistic to benefit (only batched
hypercalls will see an effect, and it's likely that lguests' entire
I/O model will be replaced with virtio and ringbuffers anyway).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/lguest/hypercalls.c | 1 -
drivers/lguest/io.c | 18 +-----------------
drivers/lguest/lg.h | 1 -
drivers/lguest/lguest_user.c | 2 --
kernel/fork.c | 1 -
5 files changed, 1 insertion(+), 22 deletions(-)

===================================================================
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -189,5 +189,4 @@ void do_hypercalls(struct lguest *lg)
do_hcall(lg, lg->regs);
clear_hcall(lg);
}
- set_wakeup_process(lg, NULL);
}
===================================================================
--- a/drivers/lguest/io.c
+++ b/drivers/lguest/io.c
@@ -296,7 +296,7 @@ static int dma_transfer(struct lguest *s

/* Do this last so dst doesn't simply sleep on lock. */
set_bit(dst->interrupt, dstlg->irqs_pending);
- set_wakeup_process(srclg, dstlg->tsk);
+ wake_up_process(dstlg->tsk);
return i == dst->num_dmas;

fail:
@@ -333,7 +333,6 @@ ...

To: Rusty Russell <rusty@...>
Cc: Andrew Morton <akpm@...>, David Miller <davem@...>, <hch@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, July 12, 2007 - 7:10 am

What about

Open /dev/lguest
transfer fd using SCM_RIGHTS (or clone()?)
close fd in original task
exit()

?

My feeling is that if you want to be bound to a task, not a file, you
need to use syscalls, not ioctls.

--
error compiling committee.c: too many arguments to function

-

To: Avi Kivity <avi@...>
Cc: Andrew Morton <akpm@...>, David Miller <davem@...>, <hch@...>, <linux-kernel@...>, carsteno <carsteno@...>
Date: Thursday, July 12, 2007 - 7:20 pm

"Don't do that". You'll lose the ability to access the operations on
the fd once you are no longer the original task (explicit check).,

It's not an exact match, but a file is a remarkably convenient
abstraction for a non-ABI such as lguest. Of course, Carsten was
talking about unifying the lguest & kvm userspace interface, so this
could well change anyway.

Cheers,
Rusty.

-

To: Rusty Russell <rusty@...>
Cc: Andrew Morton <akpm@...>, David Miller <davem@...>, <hch@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, July 19, 2007 - 1:27 pm

The version that just got into mainline still has the __put_task_struct
export despite not needing it anymore. Care to fix this up?
-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, David Miller <davem@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, July 19, 2007 - 11:27 pm

No, it got patched in then immediately patched out again. Andrew
mis-mixed my patches, but there have been so many of them I find it hard
to blame him.

Rusty.

-

To: Rusty Russell <rusty@...>
Cc: Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, David Miller <davem@...>, <linux-kernel@...>, <linux-mm@...>
Date: Friday, July 20, 2007 - 3:15 am

Indeed, the export is gone in last mainline gone.
-

To: <rusty@...>
Cc: <hch@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 10:51 pm

From: Rusty Russell <rusty@rustcorp.com.au>

I already understand what you're doing.

Is it possible to use exit notifiers to handle this case?
That's what I'm trying to suggest. :)
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 11:15 pm

Sure, the process has /dev/lguest open, so I can do something in the
close routine. Instead of keeping a reference to the tsk, I can keep a
reference to the struct lguest (currently it doesn't have or need a
refcnt). Then I need another lock, to protect lg->tsk.

This seems like a lot of dancing to avoid one export. If it's that
important I'd far rather drop the code and do a normal wakeup under the
big lguest lock for 2.6.23.

Cheers,
Rusty.

-

To: <rusty@...>
Cc: <hch@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 11:35 pm

From: Rusty Russell <rusty@rustcorp.com.au>

I'm not against the export, so use if it really helps.

Ref-counting just seems clumsy to me given how the hw assisted
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, July 11, 2007 - 8:00 am

Hopefull this will be done during the 2.6.23 merge window, but right now
it's not (yet).
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 7:55 am

Umm, Andrew - mixing new userspace interface, compltely rewritten
drivers and simple fixes in a simple misc category doesn't exactly
help reading this list :)
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, July 11, 2007 - 7:39 am

> pagefault-in-write deadlock fixes. Will hold for 2.6.24.

Why that? This stuff has been in forever and is needed at various
levels. We need this in for anything to move forward on the buffered
write front.

-

To: Christoph Hellwig <hch@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Nick Piggin <nickpiggin@...>
Date: Wednesday, July 11, 2007 - 1:23 pm

At Nick's request. More work is needed and the code hasn't had a lot of
-

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-scsi@...>
Date: Wednesday, July 11, 2007 - 7:37 am

Care to drop the patches James NACKed every single time?

-

To: Christoph Hellwig <hch@...>
Cc: <linux-kernel@...>, <linux-scsi@...>
Date: Wednesday, July 11, 2007 - 1:22 pm

I'm not aware of any which fit that description.

There may be a couple in there which fix real bugs in an unapproved way.
But I keep such patches as a matter of policy, so people keep on getting
pestered about their bugs.

-

To: Andrew Morton <akpm@...>
Cc: <dwmw2@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, July 11, 2007 - 7:35 am

NACK on this one. This bloats romfs by almost half of it's previous
size to add mtd support to it. Given that romfs is a compltely
trivial filesystem it's much better to have a separate filesystem
driver handling the format on mtd instead of adding all these
indirections. In addition to that argument the switch on the
underlying subsystem is done horrible. There's lots of ifdefs instead
of proper functions pointers, there's one file containing both block
and mtd code instead of seaparate files, etc.

And the get_unmapped_area method in a bare filesystem needs a _lot_
of explanation.

-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Wednesday, July 11, 2007 - 7:39 am

The rest of it is nacked anyway, until we unify the point and
get_unmapped_area methods of the MTD API.

--
dwmw2

-

To: David Woodhouse <dwmw2@...>
Cc: Christoph Hellwig <hch@...>, <linux-kernel@...>, <linux-fsdevel@...>, David Howells <dhowells@...>
Date: Wednesday, July 11, 2007 - 1:21 pm

Methinks you meant
nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not
romfs-printk-format-warnings.patch.

I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks.
-

To: Andrew Morton <akpm@...>
Cc: David Woodhouse <dwmw2@...>, Christoph Hellwig <hch@...>, <linux-kernel@...>, <linux-fsdevel@...>, David Howells <dhowells@...>
Date: Wednesday, July 11, 2007 - 1:28 pm

Thanks. I was certainly getting confused.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>
Cc: <linux-kernel@...>
Date: Tuesday, July 10, 2007 - 6:15 am

~swap prefetch

Nick's only remaining issue which I could remotely identify was to make it
cpuset aware:
http://marc.info/?l=linux-mm&m=117875557014098&w=2
as discussed with Paul Jackson it was cpuset aware:
http://marc.info/?l=linux-mm&m=117895463120843&w=2

I fixed all bugs I could find and improved it as much as I could last kernel
cycle.

Put me and the users out of our misery and merge it now or delete it forever
please. And if the meaningless handwaving that I 100% expect as a response
begins again, then that's fine. I'll take that as a no and you can dump it.

--
-ck
-

To: Con Kolivas <kernel@...>
Cc: Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Monday, July 23, 2007 - 7:08 pm

For what it's worth; put me down as supporting the merger of swap
prefetch. I've found it useful in the past, Con has maintained it
nicely and cleaned up everything that people have pointed out - it's
mature, does no harm - let's just get it merged. It's too late for
2.6.23-rc1 now, but let's try and get this in by -rc2 - it's long
overdue...

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
-

To: Jesper Juhl <jesper.juhl@...>
Cc: Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Monday, July 23, 2007 - 11:22 pm

Not talking about swap prefetch itself, but everytime I have asked
anyone to instrument or produce some workload where swap prefetch
helps, they never do.

Fair enough if swap prefetch helps them, but I also want to look at
why that is the case and try to improve page reclaim in some of
these situations (for example standard overnight cron jobs shouldn't
need swap prefetch on a 1 or 2GB system, I would hope).

Anyway, back to swap prefetch, I don't know why I've been singled out
as the bad guy here. I'm one of the only people who has had a look at
the damn thing and tried to point out areas where it could be improved
to the point of being included, and outlining things that are needed
for it to be merged (ie. numbers). If anyone thinks that makes me the
bad guy then they have an utterly inverted understanding of what peer
review is for.

Finally, everyone who has ever hacked on these heuristicy parts of the
VM has heaps of patches that help some workload or some silly test
case or (real or percieved) shortfall but have not been merged. It
really isn't anything personal.

If something really works, then it should be possible to get real
numbers in real situations where it helps (OK, swap prefetching won't
be as easy as a straight line performance improvement, but still much
easier than trying to measure something like scheduler interactivity).

Numbers are the best way to add weight to the pro-merge argument, so
for all the people who a whining about merging this and don't want
to actually work on the code -- post some numbers for where it helps
you!!

--
SUSE Labs, Novell Inc.
-

To: Nick Piggin <nickpiggin@...>
Cc: Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Tuesday, July 24, 2007 - 12:53 am

<Raised eyebrow> You sound frustrated. Perhaps we could be
communicating better. I'll start.

Unlike others on the cc: line, I don't get paid to hack on the kernel,
not even indirectly. So if you find that my lack of providing numbers
is giving you heartache, I can only apologize and point at my paying
work that requires my attention.

That said, I'm willing to run my day to day life through both a swap
prefetch kernel and a normal one. *However*, before I go through all
the work of instrumenting the damn thing, I'd really like Andrew (or
Linus) to lay out his acceptance criteria on the feature. Exactly what
*should* I be paying attention to? I've suggested keeping track of
process swapin delay total time, and comparing with and without. Is
that reasonable? Is it incomplete?

Without Andrew's criteria, we're back to where we've been for a long
time: lots of work, no forward motion. Perhaps it's a character flaw
of mine, but I'd really like to know what would constitute proof here
before I invest the effort. Especially given that Con has already
written a test case that shows that swap prefetch works, and that I've
given you a clear argument for why better (or even perfect) page
reclaim can't provide full coverage to all the situations that swap
prefetch helps. (Also, it's not like I've got tons free time, y'know?
Just like all the rest of you all, I have to pick and choose my
battles if I'm going to be effective.)

Since this merge period has appeared particularly frazzling for
Andrew, I've been keeping silent and waiting for him to get to a point
where there's a breather. I didn't feel it would be polite to request
yet more work out of him while he had a mess on his hands.

But, given this has come to a head, I'm asking now.

Andrew? You've always given the impression that you want this run more
as an engineering effort than an artistic endeavour, so help us out
here. What are your concerns with swap prefetch? What sort of
comparative data would you like to see to jus...

To: Ray Lee <ray-lk@...>
Cc: Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Tuesday, July 24, 2007 - 1:16 am

I don't feel it is so useful without more context. For example, in
most situations where pages get pushed to swap, there will *also* be
useful file backed pages being thrown out. Swap prefetch might
improve the total swapin delay time very significantly but that may
be just a tiny portion of the real problem.

Also a random day at the desktop, it is quite a broad scope and
pretty well impossible to analyse. If we can first try looking at
some specific problems that are easily identified.

Looking at your past email, you have a 1GB desktop system and your
overnight updatedb run is causing stuff to get swapped out such that
swap prefetch makes it significantly better. This is really
intriguing to me, and I would hope we can start by making this
particular workload "not suck" without swap prefetch (and hopefully
make it even better than it currently is with swap prefetch because
we'll try not to evict useful file backed pages as well).

After that we can look at other problems that swap prefetch helps
with, or think of some ways to measure your "whole day" scenario.

So when/if you have time, I can cook up a list of things to monitor
and possibly a patch to add some instrumentation over this updatedb
run.

Anyway, I realise swap prefetching has some situations where it will
fundamentally outperform even the page replacement oracle. This is
why I haven't asked for it to be dropped: it isn't a bad idea at all.

However, if we can improve basic page reclaim where it is obviously
lacking, that is always preferable. eg: being a highly speculative
operation, swap prefetch is not great for power efficiency -- but we
still want laptop users to have a good experience as well, right?

--
SUSE Labs, Novell Inc.
-

To: Nick Piggin <nickpiggin@...>
Cc: Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Tuesday, July 24, 2007 - 12:15 pm

Agreed, it's important to make sure we're not being penny-wise and

It is pretty broad, but that's also what swap prefetch is targetting.
As for hard to analyze, I'm not sure I agree. One can black-box test
this stuff with only a few controls. e.g., if I use the same apps each
day (mercurial, firefox, xorg, gcc), and the total I/O wait time
consistently goes down on a swap prefetch kernel (normalized by some
control statistic, such as application CPU time or total I/O, or

Always easier, true. Let's start with "My mouse jerks around under
memory load." A Google Summer of Code student working on X.Org claims
that mlocking the mouse handling routines gives a smooth cursor under
load ([1]). It's surprising that the kernel would swap that out in the
first place.

updatedb is an annoying case, because one would hope that there would
be a better way to deal with that highly specific workload. It's also
pretty stat dominant, which puts it roughly in the same category as a
git diff. (They differ in that updatedb does a lot of open()s and
getdents on directories, git merely does a ton of lstat()s instead.)

Anyway, my point is that I worry that tuning for an unusual and
infrequent workload (which updatedb certainly is), is the wrong way to

That would be appreciated. Don't spend huge amounts of time on it,
okay? Point me the right direction, and we'll see how far I can run

Absolutely. Disk I/O is the enemy, and the best I/O is one you never
had to do in the first place.
-

To: Ray Lee <ray-lk@...>
Cc: Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 12:46 am

updatedb pushing out program data may be able to be improved on with drop
behind or similar.

however another scenerio that causes a similar problem is when a user is
busy useing one of the big memory hogs and then switches to another (think

you could make a synthetic test by writing a memory hog that allocates 3/4
of your ram then pauses waiting for input and then randomly accesses the
memory for a while (say randomly accessing 2x # of pages allocated) and
then pausing again before repeating

run two of these, alternating which one is running at any one time. time
how long it takes to do the random accesses.

the difference in this time should be a fair example of how much it would
impact the user.

by the way, I've also seen comments on the Postgres performance mailing
list about how slow linux is compared to other OS's in pulling data back
in that's been pushed out to swap (not a factor on dedicated database

almost always true, however there is some amount of I/O that is free with
todays drives (remember, they read the entire track into ram and then
give you the sectors on the track that you asked for). and if you have a
raid array this is even more true.

if you read one sector in from a raid5 array you have done all the same
I/O that you would have to do to read in the entire stripe, but I don't
believe that the current system will keep it all around if it exceeds the
readahead limit.

so in many cases readahead may end up being significantly cheaper then you
expect.

David Lang
-

To: <david@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:00 am

Notenotenote, not sure what you're going to show with it (times are simply
as horrendous as I'd expect) but thought I'd try to inject something other
than steaming cups of 4-letter beverages.

Rene.

To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:07 am

when the swap readahead is enabled does it make a significant difference
in the time to do the random access?

if it does that should show a direct benifit of the patch in a simulation
of a relativly common workflow (startup a memory hog like openoffice then
try and go back to your prior work)

David Lang

-

To: <david@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:29 am

I don't use swap prefetch (nor -ck or -mm). If someone who has the patch
applied waits to hit enter until swap prefetch has prefetched it all back in
again, it certainly will.

Swap prefetch's potential to do larger reads back from swapspace than a
random segfaulting app could well be very significant. Reads are dwarved by
seeks. If this program does what you wanted, please use it to show us.

Rene.
-

To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:31 am

I haven't used swap prefetch either, the call was put out for what could
be used to test the performance, and I was suggesting a test.

if nobody else follows up on this I'll try to get some time to test it
myself in a day or two.

David Lang
-

To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:33 am

this assumes that this isn't ruled an invalid test in the meantime.

in any case thanks for codeing this up so quickly.

David Lang
-

To: <david@...>
Cc: Ray Lee <ray-lk@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 6:58 am

Let's save a little time and guess. While two instances of the hog are
running no physical memory is free (as together they take up 1.5x physical)
meaning that swap-prefetch wouldn't get a change to do anything and wouldn't
make a difference. As such, the two instances test as you suggested would in
fact not be testing anything it seems.

However, if you quit one, and idle long enough to continue with the other
one until swap-prefetch prefetched all its memory back in, it should be a
difference on the order of minutes, even total if swap prefetch fetched it
back in without seeking al over swap-space, and "total" isn't applicable if
the idle time really is free.

A program randomly touching single pages all over memory is a contrived
worst case scenario and not a real-world issue. It is a boundary condition
though, and it's simply quite impossible to think of any example where
swap-prefetch would _not_ give you a snappier feeling machine after you've
been idling.

So really the only question would seem to be -- does it hurt any if you have
_not_ been?

Rene.

-

To: david@lang.hm <david@...>, Al Boldi <a1426z@...>
Cc: Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 11:55 am

Hoo boy, lots of messages this morning.

(Al? I've added you to the CC: because of your swap-in vs swap-out
speed report from January. See below -- half-way down or so -- for
more detals.)

Yes, and that was the core of my original report months ago. I'm
working for a while on one task, go to openoffice to view a report, or
gimp to tweak the colors on a photo before uploading it, and then go
back to my email and... and... and... there we go. The faults that

Con wrote a benchmark much like that. It showed measurable improvement

Yeah, akpm and... one of the usual suspects, had mentioned something
such as 2.6 is half the speed of 2.4 for swapin. (Let's see if I can
find a reference for that, it's been a year or more...) Okay,
misremembered. Swap in is half the speed of swap out (
http://lkml.org/lkml/2007/1/22/173 ). Al Boldi (added to the CC:, poor
sod), is the one who knows how to measure that, I'm guessing.

Al? How are you coming up with those figures? I'm interested in
reproducing it. It could be due to something stupid, such as the VM

Yeah, I knew I'd get called on that one :-). It's the seeks that'll
really kill you, and as you say once you're on the track the rest is
practically free (which is why the VM should prefer to evict larger
chunks at a time rather than lots of small things, see
http://lkml.org/lkml/2007/7/23/214 for something that's heading the

Fengguang Wu is doing lots of active work on making the readahead suck
less. Ping him and he'll likely take an active interest in the RAID
stuff.

Ray
-

To: Ray Lee <ray-lk@...>, david@lang.hm <david@...>
Cc: Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:16 pm

Thanks for asking. I'm rather surprised why nobody's noticing any of this
slowdown. To be fair, it's not really a regression, on the contrary, 2.4 is
lot worse wrt swapin and swapout, and Rik van Riel even considers a 50%
swapin slowdown wrt swapout something like better than expected (see thread
'[RFC] kswapd: Kernel Swapper performance'). He probably meant random
swapin, which seems to offer a 4x slowdown.

There are two ways to reproduce this:

1. swsusp to disk reports ~44mb/s swapout, and ~25mb/s swapin during resume

2. tmpfs swapout is superfast, whereas swapin is really slow
(see thread '[PATCH] free swap space when (re)activating page')

Here is an excerpt from that thread (note machine config in first line):

============================================
RAM 512mb , SWAP 1G
#mount -t tmpfs -o size=1G none /dev/shm
#time cat /dev/full > /dev/shm/x.dmp
15sec
#time cat /dev/shm/x.dmp > /dev/null
58sec
#time cat /dev/shm/x.dmp > /dev/null
72sec
#time cat /dev/shm/x.dmp > /dev/null
85sec
#time cat /dev/shm/x.dmp > /dev/null
93sec
#time cat /dev/shm/x.dmp > /dev/null
99sec
============================================

As you can see, swapout is running full wirespeed, whereas swapin not only is
4x slower, but increasingly gets the VM tangled up to end at a ~6x slowdown.

So again, I'm really surprised people haven't noticed.

Thanks!

--
Al

-

To: Al Boldi <a1426z@...>
Cc: Ray Lee <ray-lk@...>, david@lang.hm <david@...>, Nick Piggin <nickpiggin@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 8:28 pm

Sorry for the late reply.
Well I think I reported this or another swap/tmpfs performance issue earlier ( http://marc.info/?t=116542915700004&r=1&w=2 ), we got the suggestion to increase /proc/sys/vm/page-cluster to 5, but we never came around to try it.
Maybe this was the reason for my report to be almost entirely ignored, sorry for that.

Regards,
Magnus
-

To: Ray Lee <ray-lk@...>
Cc: Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 12:06 am

I'm not saying that we can't try to tackle that problem, but first of
all you have a really nice narrow problem where updatedb seems to be
causing the kernel to completely do the wrong thing. So we start on

OK, I'm not sure what the point is though. Under heavy memory load,
things are going to get swapped out... and swap prefetch isn't going
to help there (at least, not during the memory load).

There are also other issues like whether the CPU scheduler is at fault,
etc. Interactive workloads are always the hardest to work out. updatedb

Yeah, and I suspect we might be able to do better use-once of
inode and dentry caches. It isn't really highly specific: lots
of things tend to just scan over a few files once -- updatedb

Well it runs every day or so for every desktop Linux user, and
it has similarities with other workloads. We don't want to optimise
it at the expense of other things, but it _really_ should not be

I guess /proc/meminfo, /proc/zoneinfo, /proc/vmstat, /proc/slabinfo
before and after the updatedb run with the latest kernel would be a
first step. top and vmstat output during the run wouldn't hurt either.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-

To: Nick Piggin <nickpiggin@...>
Cc: Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Ingo Molnar <mingo@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 25, 2007 - 4:46 pm

One simple way to fix this would be to implement a fadvise() flag
that puts the dentry/inode on a "soon to be expired" list if there
are no other references. Then if a dentry allocation needs more
memory try to reuse dentries from that list (or better queue) first. Any other
access will remove the dentry from the list.

Disadvantage would be that the userland would need to be patched,
but I guess it's better than adding very dubious heuristics to the
kernel.

Similar thing could be done for directory buffers although they
are probably less of a problem.

I expect that C.Lameter's directed dentry/inode freeing in slub will also
make a big difference. People who have problems with updatedb should
definitely try mm which has it I believe and enable SLUB.

-Andi (who always thought swap prefetch was just a workaround, not
a real solution)
-

To: <linux-kernel@...>
Cc: <ck@...>, <linux-mm@...>
Date: Thursday, July 26, 2007 - 4:38 am

Are you going to change every single large memory application in the
world? As I wrote before, it is *not* about updatedb, but about all
applications that use a lot of memory, and then terminate.

Frank

-

To: Frank Kingswood <frank@...>
Cc: Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, Andrew Morton <akpm@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 5:20 am

it is about multiple problems, _one_ problem is updatedb. The _second_
problem is large memory applications.

note that updatedb is not a "large memory application". It simply scans
through the filesystem and has pretty minimal memory footprint.

the _kernel_ ends up blowing up the dentry cache to a rather large size
(because it has no idea that updatedb uses every dentry only once).

Once we give the kernel the knowledge that the dentry wont be used again
by this app, the kernel can do a lot more intelligent decision and not
baloon the dentry cache.

( we _do_ want to baloon the dentry cache otherwise - for things like
"find" - having a fast VFS is important. But known-use-once things
like the daily updatedb job can clearly be annotated properly. )

the 'large memory apps' are a second category of problems. And those are
where swap-prefetch could indeed help. (as long as it only 'fills up'
the free memory that a large-memory-exit left behind it.)

the 'morning after' phenomenon that the majority of testers complained
about will likely be resolved by the updatedb change. The second
category is likely an improvement too, for swap-happy desktop (and
server) workloads.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 5:34 am

Mutter. /proc/sys/vm/vfs_cache_pressure has been there for what, three
years? Are any distros raising it during the updatedb run yet?
-

To: Andrew Morton <akpm@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 5:40 am

but ... that's system-wide, and the 'dont baloon the dcache' is only a
property of updatedb. Still, it's useful to debug this thing.

below is an updatedb hack that sets vfs_cache_pressure down to 0 during
an updatedb run. Could someone who is affected by the 'morning after'
problem give it a try? If this works then we can think about any other
measures ...

Ingo

--- /etc/cron.daily/mlocate.cron.orig
+++ /etc/cron.daily/mlocate.cron
@@ -1,4 +1,7 @@
#!/bin/sh
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')
renice +19 -p $$ >/dev/null 2>&1
+PREV=`cat /proc/sys/vm/vfs_cache_pressure 2>/dev/null`
+echo 0 > /proc/sys/vm/vfs_cache_pressure 2>/dev/null
/usr/bin/updatedb -f "$nodevs"
+[ "$PREV" != "" ] && echo $PREV > /proc/sys/vm/vfs_cache_pressure 2>/dev/null
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 6:20 am

BTW, I really wonder how much pain could be avoided if updatedb recorded
mtime of directories and checked it. I.e. instead of just doing blind
find(1), walk the stored directory tree comparing timestamps with those
in filesystem. If directory mtime has not changed, don't bother rereading
it and just go for (stored) subdirectories. If it has changed - reread the
sucker. If we have a match for stored subdirectory of changed directory,
check inumber; if it doesn't match, consider the entire subtree as new
one. AFAICS, that could eliminate quite a bit of IO...
-

To: Al Viro <viro@...>
Cc: <mingo@...>, <akpm@...>, <frank@...>, <andi@...>, <nickpiggin@...>, <ray-lk@...>, <jesper.juhl@...>, <ck@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 3:19 pm

Someone mentioned a variant of slocate above that they called mlocate,
and that Red Hat ships, that seems to do this (if I understand you and
what mlocate does correctly.)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-

To: Al Viro <viro@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 8:23 am

That would just save reading the directories. Not sure
it helps that much. Much better would be actually if it didn't stat the
individual files (and force their dentries/inodes in). I bet it does that to
find out if they are directories or not. But in a modern system it could just
check the type in the dirent on file systems that support
that and not do a stat. Then you would get much less dentries/inodes.

Also I expect in general the new slub dcache freeing that is pending
will improve things a lot.

But even if updatedb was fixed to be more efficient we probably
still need a general solution for other tree walking programs
that cannot be optimized this way.

-Andi

-

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Frank Kingswood <frank@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 10:59 am

FWIW, find(1) does *not* stat non-directories (and neither would this
approach). So it's just dentries for directories and you can't realistically
skip those. OK, you could - if you had banned cross-directory rename
for directories and propagated "dirty since last look" towards root (note
that it would be a boolean, not a timestamp). Then we could skip unchanged
subtrees completely...
-

To: Al Viro <viro@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Frank Kingswood <frank@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, July 11, 2007 - 4:41 pm

Could we help it a little from kernel and set 'dirty since last look'
on directory renames?

I mean, this is not only updatedb. KDE startup is limited by this,
too. It would be nice to have effective 'what change in tree'
operation.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

To: Ingo Molnar <mingo@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 6:09 am

Sure, but it's practical, isn't it? Who runs (and cares about)
vfs-intensive workloads during their wee-small-hours updatedb run?

Setting it to zero will maximise the preservation of the vfs caches. You
wanted 10000 there.

<bets that nobody will test this>
-

To: Andrew Morton <akpm@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 6:27 am

there's another side-effect: it likely results in the zapping of
thousands of dentries that were cached nicely before. So we might
exchange 'all my apps are swapped out' experience with 'all file access
is slow'. The latter is _probably_ still an improvement over the
balooning, but i'm not sure. What we _really_ want is an updatedb that
does not disturb the dcache.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 6:38 am

Yup. Nobody has begun to think about preserving dcache/icache across load

Well. Hopefully this time next year you can prep a 16MB container and toss
your updatedb inside that. Maybe set its peak disk bandwidth utilisation
too. However that won't work ;) because I don't think anyone is looking
at containerisation of vfs cache memory yet. Perhaps full-on openvz has it,
dunno.

But updatedb is a special case, because it is so vfs-intensive. For lots
of other workloads (those which use heaps of pagecache), resource
management via containerisation will work nicely.
-

To: Andrew Morton <akpm@...>
Cc: Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 6:24 am

wrong, it's active on three of my boxes already :) But then again, i
never had these hangover problems. (not really expected with gigs of RAM
anyway)

Ingo

--- /etc/cron.daily/mlocate.cron.orig
+++ /etc/cron.daily/mlocate.cron
@@ -1,4 +1,7 @@
#!/bin/sh
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')
renice +19 -p $$ >/dev/null 2>&1
+PREV=`cat /proc/sys/vm/vfs_cache_pressure 2>/dev/null`
+echo 10000 > /proc/sys/vm/vfs_cache_pressure 2>/dev/null
/usr/bin/updatedb -f "$nodevs"
+[ "$PREV" != "" ] && echo $PREV > /proc/sys/vm/vfs_cache_pressure 2>/dev/null
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, <linux-kernel@...>, ck list <ck@...>, <linux-mm@...>, Paul Jackson <pj@...>, Andi Kleen <andi@...>, Frank Kingswood <frank@...>
Date: Thursday, July 26, 2007 - 8:33 pm

[...]

mlocate by design doesn't thrash the cache as much. People using
slocate (distros other than Redhat ;) are going to be hit worse. See
http://carolina.mff.cuni.cz/~trmac/blog/mlocate/

updatedb by itself doesn't really bug me, its just that on occasion
its still running at 7am which then doesn't assist my single spindle
come swapin of the other apps! I'm considering getting one of the old
ide drives out in the garage and shifting swap onto it. The swap
prefetch patch has mainly assisted me in the "state A -> B -> A"
scenario. A lot.

--
Matt
-

To: Matthew Hawkins <darthmdh@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, <linux-kernel@...>, ck list <ck@...>, <linux-mm@...>, Paul Jackson <pj@...>, Andi Kleen <andi@...>, Frank Kingswood <frank@...>
Date: Monday, July 30, 2007 - 5:33 am

You should start it earlier then - assuming it doesn't
already start at the earliest opportunity?

Helge Hafting
-

To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 8:46 am

drops caches prior to both updatedb runs.

root@Homer: df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/hdc3 12500992 1043544 11457448 9% /
udev 129162 1567 127595 2% /dev
/dev/hdc1 26104 87 26017 1% /boot
/dev/hda1 108144 90676 17468 84% /windows/C
/dev/hda5 11136 3389 7747 31% /windows/D
/dev/hda6 0 0 0 - /windows/E

vfs_cache_pressure=10000, updatedb freshly completed:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 48 76348 420356 104748 0 0 0 0 1137 912 3 1 97 0

ext3_inode_cache 315153 316274 524 7 1 : tunables 54 27 8 : slabdata 45182 45182 0
dentry_cache 224829 281358 136 29 1 : tunables 120 60 8 : slabdata 9702 9702 0
buffer_head 156624 159728 56 67 1 : tunables 120 60 8 : slabdata 2384 2384 0

vfs_cache_pressure=100 (stock), updatedb freshly completed:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 148 83824 270088 116340 0 0 0 0 1095 330 2 1 97 0

ext3_inode_cache 467257 502495 524 7 1 : tunables 54 27 8 : slabdata 71785 71785 0
dentry_cache 292695 408958 136 29 1 : tunables 120 60 8 : slabdata 14102 14102 0
buffer_head 118329 184384 56 67 1 : tunables 120 60 8 : slabdata 2752 2752 1

Note: updatedb doesn't bother my box, not running enough leaky apps I
guess.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, July 26, 2007 - 2:05 pm

I think that was the wrong thing to do. That will leave gobs of free
memory for updatedb to populate with dentries and inodes.

Instead, fill all of memory up with pagecache, then do the updatedb. See

So you ended up with a couple hundred MB of pagecache preserved.

Capturing before-and-after /proc/meminfo would be nice - it's a useful
summary.

-

To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 1:12 am

Yeah. Before these two runs just to see what difference there was in
caches with those two settings, I tried running with a heavier than
normal (for me) desktop application mix, to see if it would start
swapping, but it didn't. Seems that 1GB ram is enough space for
everything I do, and everything updatedb does as well. You need a
larger working set to feel the pain I guess.

-Mike

-

To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 3:23 am

I didn't _fill_ memory, but loaded it up a bit with some real workload
data...

I tried time sh -c 'git diff v2.6.11 HEAD > /dev/null' to populate the
cache, and tried different values for vfs_cache_pressure. Nothing
prevented git's data from being trashed by updatedb. Turning the knob
downward rapidly became very unpleasant due to swap, (with 0 not
surprisingly being a true horror) but turning it up didn't help git one
bit. The amount of data that had to be re-read with stock 100 or 10000
was the same, or at least so close that you couldn't see a difference in
vmstat and wall-clock. Cache sizes varied, but the bottom line didn't.
(wasn't surprised, seems quite reasonable that git's data looks old and
useless to the reclaim logic when updatedb runs in between git runs)

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 4:47 am

Did a bit of playing with this with 128MB of memory.

- drop caches
- read a 1MB file
- run slocate.cron

With vfs_cache_pressure=100:

MemTotal: 116316 kB
MemFree: 3196 kB
Buffers: 54408 kB
Cached: 5128 kB
SwapCached: 0 kB
Active: 41728 kB
Inactive: 27540 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 116316 kB
LowFree: 3196 kB
SwapTotal: 1020116 kB
SwapFree: 1019496 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 9760 kB
Mapped: 3808 kB
Slab: 40468 kB
SReclaimable: 34824 kB
SUnreclaim: 5644 kB
PageTables: 720 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1078272 kB
Committed_AS: 25988 kB
VmallocTotal: 901112 kB
VmallocUsed: 656 kB
VmallocChunk: 900412 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

WIth vfs_cache_pressure=10000:

MemTotal: 116316 kB
MemFree: 3060 kB
Buffers: 80792 kB
Cached: 5052 kB
SwapCached: 0 kB
Active: 59432 kB
Inactive: 36140 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 116316 kB
LowFree: 3060 kB
SwapTotal: 1020116 kB
SwapFree: 1019512 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 9756 kB
Mapped: 3832 kB
Slab: 14304 kB
SReclaimable: 7992 kB
SUnreclaim: 6312 kB
PageTables: 732 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1078272 kB
Committed_AS: 26000 kB
VmallocTotal: 901112 kB
VmallocUsed: 656 kB
VmallocChunk: 900412 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 4096 kB

so we reaped quite a lot more slab with the higher vfs_cache_pressure.

What I think is killing us here is the blockdev pagecache: the pagecache
which ...

To: Andrew Morton <akpm@...>
Cc: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 9:33 pm

Good idea for updatedb.

However, it may be a bad idea for files that are often
written to. Turning an inode write into a read plus a
write does not sound like such a hot idea, we really
want to keep those in the cache.

I think what you need is to ignore multiple references
to the same page when they all happen in one time
interval, counting them only if they happen in multiple
time intervals.

The use-once cleanup (which takes a page flag for PG_new,
I know...) would solve that problem.

However, it would introduce the problem of having to scan
all the pages on the list before a page becomes freeable.
We would have to add some background scanning (or a separate
list for PG_new pages) to make the initial pageout run use
an acceptable amount of CPU time.

Not sure that complexity will be worth it...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-

To: Rik van Riel <riel@...>
Cc: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 11:39 pm

Remember that this problem applies to both inode blocks and to directory
blocks. Yes, it might be useful to hold onto an inode block for a future

Yes, the sudden burst of accesses for adjacent inode/dirents will be a
common pattern, and it'd make heaps of sense to treat that as a single
touch. It'd have to be done in the fs I guess, and it might be a bit hard
to do. And it turns out that embedding the touch_buffer() all the way down
in __find_get_block() was convenient, but it's going to be tricky to
change.

For now I'm fairly inclined to just nuke the touch_buffer() on the read side
and maybe add one on the modification codepaths and see what happens.

I suspect that the situation we have now is so bad that pretty much
anything we do will be an improvement. I've always wondered "ytf is there
so much blockdev pagecache?"

This machine I'm typing at:

MemTotal: 3975080 kB
MemFree: 750400 kB
Buffers: 547736 kB
Cached: 1299532 kB
SwapCached: 12772 kB
Active: 1789864 kB
Inactive: 861420 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 3975080 kB
LowFree: 750400 kB
SwapTotal: 4875716 kB
SwapFree: 4715660 kB
Dirty: 76 kB
Writeback: 0 kB
Mapped: 638036 kB
Slab: 522724 kB
CommitLimit: 6863256 kB
Committed_AS: 1115632 kB
PageTables: 14452 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 36432 kB
VmallocChunk: 34359696379 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB

More that a quarter of my RAM in fs metadata! Most of it I'll bet is on the
active list. And the fs on which I do most of the work is mounted
noatime..

-

To: Andrew Morton <akpm@...>
Cc: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 4:54 am

I wonder what happens if you try that on ext2. There we'd get directory
contents in per-directory page cache, so the picture might change...
-

To: Al Viro <viro@...>
Cc: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 5:02 am

afacit ext2 just forgets to run mark_page_accessed for directory pages
altogether, so it'll be equivalent to ext3 with that one-liner, I expect.

The directory pagecache on ext2 might get reclaimed faster because those
pages are eligible for reclaiming via the reclaim of their inodes, whereas
ext3's directories are in blockdev pagecache, for which the reclaim-via-inode
mechanism cannot happen.

I should do some testing with mmapped files.
-

To: Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 6:00 am

So much for that theory. afaict mmapped, active pagecache is immune to
updatedb activity. It just sits there while updatedb continues munching
away at the slab and blockdev pagecache which it instantiated. I assume
we're never getting the VM into enough trouble to tip it over the
start-reclaiming-mapped-pages threshold (ie: /proc/sys/vm/swappiness).

Start the updatedb on this 128MB machine with 80MB of mapped pagecache, it
falls to 55MB fairly soon and then never changes.

So hrm. Are we sure that updatedb is the problem? There are quite a few
heavyweight things which happen in the wee small hours.

-

To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 6:25 am

The balance in _my_ world seems just fine. I don't let any of those
system maintenance things run while I'm using the system, and it doesn't
bother me if my working set has to be reconstructed after heavy-weight
maintenance things are allowed to run. I'm not seeing anything I
wouldn't expect to see when running a job the size of updatedb.

-Mike

-

To: Mike Galbraith <efault@...>
Cc: Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 1:45 pm

Do you realize you've totally missed the point?

It isn't about what is fine in the Kernel Developers world, but what is fine
in the *USERS* world.

There are dozens of big businesses pushing Linux for Enterprise performance.
Rather than discussing the merit of those patches - some of which just
improve the performance of a specific application by 1 or 2 percent - they
get a nod and go into the kernel. But when a group of users that don't
represent one of those businesses says "Hey, this helps with problems I see
on my system" there is a big discussion and ultimately those patches get
rejected. Why? Because they'll give an example using a program that they see
causing part of the problem and be told "Use program X - it does things
differently and shouldn't cause the problem" or "But what causes the problem
to happen? The patch treats a symptom of a larger problem".

The fucked up part of that is that the (mass of) kernel developers will see a
similar report saying "mySQL has a performance problem because of X, this
fixes it" and not blink twice - even if it is "treating the symptom and not
the cause". It's this attitude more than anything that caused Con
to "retire" - at least that is the impression I got from the interviews he's
given. (The exact impression was "I'm sick of the kernel developers doing
everything they can to help enterprise users and ignoring the home users")

So...
The problem:
Updatedb or another process that uses the FS heavily runs on a users 256MB
P3-800 (when it is idle) and the VFS caches grow, causing memory pressure
that causes other applications to be swapped to disk. In the morning the user
has to wait for the system to swap those applications back in.

Questions about it:
Q) Does swap-prefetch help with this?
A) [From all reports I've seen (*)] Yes, it does.

Q) Why does it help?
A) Because it pro-actively swaps stuff back-in when the memory pressure that
caused it to be swapped out is gone.

Q) What causes the problem...

To: Daniel Hazelton <dhazelton@...>
Cc: Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 2:16 pm

No it does not. If updatedb filled memory to the point of causing swapping
(which noone is reproducing anyway) it HAS FILLED MEMORY and swap-prefetch
hasn't any memory to prefetch into -- updatedb itself doesn't use any
significant memory.

Here's swap-prefetch's author saying the same:

http://lkml.org/lkml/2007/2/9/112

| It can't help the updatedb scenario. Updatedb leaves the ram full and
| swap prefetch wants to cost as little as possible so it will never
| move anything out of ram in preference for the pages it wants to swap
| back in.

Now please finally either understand this, or tell us how we're wrong.

Rene.

-

To: Rene Herman <rene.herman@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Friday, July 27, 2007 - 3:43 pm

however there are other programs which are known to take up significant
amounts of memory and will cause the issue being described (openoffice for
example)

please don't get hung up on the text 'updatedb' and accept that there are
programs that do run intermittently and do use a significant amount of ram
and then free it.

-

To: <david@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 3:19 am

Different issue. One that's worth pursueing perhaps, but a different issue
from the VFS caches issue that people have been trying to track down.

Rene.
-

To: Rene Herman <rene.herman@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 4:55 am

people are trying to track down the problem of their machine being slow
until enough data is swapped back in to operate normally.

in at some situations swap prefetch can help becouse something that used
memory freed it so there is free memory that could be filled with data
(which is something that Linux does agressivly in most other situations)

in some other situations swap prefetch cannot help becouse useless data is
getting cached at the expense of useful data.

nobody is arguing that swap prefetch helps in the second cast.

what people are arguing is that there are situations where it helps for
the first case. on some machines and version of updatedb the nighly run of
updatedb can cause both sets of problems. but the nightly updatedb run is
not the only thing that can cause problems

but let's talk about the concept here for a little bit

the design is to use CPU and I/O capacity that's otherwise idle to fill
free memory with data from swap.

pro:
more ram has potentially useful data in it

con:
it takes a little extra effort to give this memory to another app (the
page must be removed from the list and zeroed at the time it's needed, I
assume that the data is left in swap so that it doesn't have to be written
out again)

it adds some complexity to the kernel (~500 lines IIRC from this thread)

by undoing recent swapouts it can potentially mask problems with swapout

it looks to me like unless the code was really bad (and after 23 months in
-mm it doesn't sound like it is) that the only significant con left is the
potential to mask other problems.

however there are many legitimate cases where it is definantly dong the
right thing (swapout was correct in pushing out the pages, but now the
cause of that preasure is gone). the amount of benifit from this will vary
from situation to situation, but it's not reasonable to claim that this
provides no benifit (you have benchmark numbers that show it in synthetic
benchmarks, and you have u...

To: <david@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 6:11 am

Oh yes they are. Daniel for example did twice, telling me to turn my brain
on in between (if you read it, you may have noticed I got a little annoyed

Not to sound pretentious or anything but I assume that Andrew has a fairly
good overview of exactly how broken -mm can be at times. How many -mm users
use it anyway? He himself said he's not convinced of usefulness having not
seen it help for him (and notice that most developers are also users),
turned it off due to it annoying him at some point and hasn't seen a serious

Which is not a madeup issue, mind you. As an example, I just now tried GNU
locate and saw it's a complete pig and specifically unsuitable for the low
memory boxes under discussion. Upon completion, it actually frees enough
memory that swap-prefetch _could_ help on some boxes, while the real issue

I certainly would not want to argue anything of the sort no. As said a few
times, I agree that swap-prefetch makes sense and has at least the potential
to help some situations that you really wouldnt even want to try and fix any

Well, _that_ is what the kernel is already going to great lengths at doing,
and it decided that those pages us poor overnight OO.o users want in in the
morning weren't reasonable guesses. The kernel also won't any time soon be
reading our minds, so any solution would need either user intervention (we
could devise a way to tell the kernel "hey ho, I consider these pages to be
very important -- try not to swap them out" possible even with a "and if you
do, please pull them back in when possible") or we can let swap-prefetch do
the "just in case" thing it is doing.

While swap-prefetch may not be the be all end all of solutions I agree that
having a machine sit around with free memory and applications in swap seems
not too useful if (as is the case) fetched pages can be dropped immediately
when it turns out swap-prefetch made the wrong decision.

So that's for the concept. As to implementation, if I try and look at the
c...

To: Rene Herman <rene.herman@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 5:00 pm

if that was the case then people should be responding to the request to
get it merged with 'but it caused problems for me when I tried it'

I see the conclusion as being exactly the opposite.

here is a workload with some badly designed userspace software that the
kernel can make much more pleasent for users.

arguing that users should never use badly designed software in userspace
doesn't seem like an argument that will gain much traction. I'm not saying
the kernel needs to fix the software itself (ala the sched_yeild issues),
but the kernel should try and keep such software from hurting the rest of
the system where it can.

in this case it can't help it while the bad software is running, but it

so there is a legitimate situation where swap-prefetch will help
significantly, what is the downside that prevents it from being included?
(reading this thread it sometimes seems like the downside is that updatedb
shouldn't cause this problem and so if you fixed updatedb there wold be no
legitimate benifit, or alturnatly this patch doesn't help updatedb so

it's not that they shouldn't have been swapped out (they should have

I've seen it mentioned that there is still a maintainer but I missed who
it is, but I haven't seen any concerns that can be addressed, they all
seem to be 'this is a core concept, people need to think about it' or 'but
someone may find a better answer in the future' type of things. it's
impossible to address these concerns directly.

David Lang
-

To: <david@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, July 29, 2007 - 6:09 am

So you're saying Andrew did not say that? You're jumping to the conclusion

And now you do it again :-) There is no conclusion -- just the inescapable
observation that swap-prefetch was (or may have been) masking the problem of

People being unconvinced it helps all that much, no serious investigation
into possible downsides and no consideration of alternatives is three I've
personally heard.

You don't want to merge a conceptually core VM feature if you're not really
convinced. It's not a part of the kernel you can throw a feature into like
you could some driver saying "ah, heck, if it makes someone happy" since
everything in the VM ends up interacting -- that in fact is actually the
hard part of VM as far as I've seen it.

And in this situation the proposed feature is something that "papers over a
problem" by design -- where it could certainly be that the problem is not
solveable in another way simply due to the kernel not growing the possiblity
to read user's minds anytime soon (which some might even like to rephrase as
"due to no problem existing") but that this gets people a bit anxious is not

So do it indirectly. But please don't just say "it help some people (not me
mind you!) so merge it and if you don't it's all just politics and we can't
do anything about it anyway". Because that's mostly what I've been hearing.

And no, I'm not subscribed to any ck mailinglists nor do I hang around its
IRC community which will can account for part of that. I expect though that
the same holds for the people that actually matter in this, such as Andrew
Morton and Nick Piggin.

-- 1: people being unconvinced it helps all that much

At least partly caused by the updatedb i/dcache red herring that infected
this issue. Also, at the point VM pressure has mounted high enough to cause
enough to be swapped out to give you a bad experience, a lot of other things
have been dropped already as well.

It's unsurprising though that it would for example help the issue of
...

To: Rene Herman <rene.herman@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, July 29, 2007 - 7:41 am

I don't remember anyone saying that it actually caused problems (including
both you and andrew). I (and others) have been trying to learn what
problems people believe it has in the hope that they can be addressed one

isn't your conclusion then that if people just stopped useing that version
of updatedb the problem would be solved and there would be no need for the
swap prefetch patch? that seemed to be what you were strongly implying (if

people who have lots of memory and so don't use swap will never see the
benifit of this patch. over the years many people have investigated the
problem and tried to address it in other ways (the better version of
updatedb is an attempt to fix it for that program as an example), but
there is still a problem.

I agree that tinkering with the core VM code should not be done lightly,
but this has been put through the proper process and is stalled with no

forget the nightly cron jobs for the moment. think of this scenerio. you
have your memory fairly full with apps that you have open (including
firefox with many tabs), you receive a spreadsheet you need to look at, so
you fire up openoffice to look at it. then you exit openoffice and try to
go back to firefox (after a pause while you walk to the printer to get
the printout of the spreadsheet), only to find that it's going to be
sluggish becouse it got swapped out due to the preasure from openoffice.

no nightly cron job needed, just enough of a memory hog or a small enough

larger swap granularity may help, but waiting for the user to need the ram
and have to wait for it to be read back in is always going to be worse for
the user then pre-populating the free memory (for the case where the
pre-population is right, for other cases it's the same). so I see this as

there are fully legitimate situations where this is useful, the 'papering
over' effect is not referring to these, it's referring to other possible
problems in the future. I see this argument as being in the same c...

To: <david@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, July 29, 2007 - 10:01 am

No. What I said outright, every single time, is that swap-prefetch in itself
seems to make sense. And specifically that even if the _direct_ problem is a
crummy program, it _still_ makes sense generally. Every single time.

But see -- you failed to notice this because you guys are stuck in this dumb
adversary "us against them" thing so inherent of (online) communities, where
you sit around your own habitats patting each other on the back for extended
periods of time and then every once a while go out clinging on to each other
vigorously and going "boo! hiss!" at the big bad outside world.

I already got overly violent at one point in this thread so I'll leave out
any further references to sense-deprived fanboy-culture but please, I said
every single time that I'm not against swap-prefetch. I cannot communicate

It has not. Concerns that were raised (by specifically Nick Piggin) weren't

And swinging a dead rat from its tail facing east-wards while reciting
Documentation/CodingStyle.

Okay, very very sorry, that was particularly childish, but that "walking to
the printer" is ofcourse completely constructed and this _is_ something to
take into account. Swap-prefetch wants to be free, which (also again) it is
doing a good job at it seems, but this also means that it waits for the VM
to be _very_ idle before it does anything and as such, we cannot just forget
the "nightly" scenario and pretend it's about something else entirely. As
long as the machine's being used, swap-prefetch doesn't kick in.

Which is a good feature for swap-prefetch, but also something that needs to
weighed alongside its other features in a discussion of alternatives, where
for example something like a larger swap granularity would not have anything
of the sort to take into account. If it were about walks to the printer, we

Arjan van de Ven made another point here about seeking away due to
swap-prefetch (just) before the next request comes in, but that's probably a

I saw Chris Snook m...

To: Rene Herman <rene.herman@...>
Cc: Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, July 29, 2007 - 5:19 pm

I may have missed them, but what I saw from him weren't specific issues,

yes it was contrived for simplicity.

the same effect would happen if instead of going back to firefox the user
instead went to their e-mail software and read some mail. doing so should

how long does the machine need to be idle? if someone spends 30 seconds
reading an e-mail that's an incredibly long time for the system and I

swapin will always require disk access, and avoiding doing disk access
while the user is waiting for it by doing it when the system isn't useing
the disk will always be a win (possibly not as large of a win, but still a
win) on slow laptop drives where you may only get 20MB/second of reads
under optimal situations it doesn't take much reading to be noticed by the

and these thing do not conflict with prefetch, they compliment it.

improved use-once will avoid pushing things out to swap in the first
place. this will help during normal workloads so is valuble in any case.

better swapin (I assume you are talking about things like larger swap
granularity) will also help during normal workloads when you are thrashing
into swap.

prefetch will help when you have pushed things out to swap and now have
free memory and a momentarily idle system.

David Lang
-

To: <david@...>
Cc: Rene Herman <rene.herman@...>, Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, August 5, 2007 - 10:14 pm

Something better, ie. the problems with page reclaim being fixed.
Why is that nebulous?

--
SUSE Labs, Novell Inc.
-

To: Nick Piggin <nickpiggin@...>
Cc: Rene Herman <rene.herman@...>, Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, August 5, 2007 - 10:22 pm

becouse that doesn't begin to address all the benifits.

the approach of fixing page reclaim and updatedb is pretending that if you
only do everything right pages won't get pushed to swap in the first
place, and therefor swap prefetch won't be needed.

this completely ignores the use case where the swapping was exactly the
right thing to do, but memory has been freed up from a program exiting so
that you couldnow fill that empty ram with data that was swapped out.

David Lang
-

To: <david@...>
Cc: Rene Herman <rene.herman@...>, Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Monday, August 6, 2007 - 5:21 am

What do you mean "address the benefits"? What I want

You should read what I wrote.

Anyway, the fact of the matter is that there are still
fairly significant problems with page reclaim in this
workload which I would like to see fixed.

I personally still think some of the low hanging fruit
*might* be better fixed before swap prefetch gets
merged, but I've repeatedly said I'm sick of getting
dragged back into the whole debate so I'm happy with
whatever Andrew decides to do with it.

I think it is sad to turn it off for laptops, if it
really makes the "desktop" experience so much better.
Surely for _most_ workloads we should be able to

Yeah. However, merging patches (especially when
changing heuristics, especially in page reclaim) is
not about just thinking up a use-case that it works
well for and telling people that they're putting their
heads in the sand if they say anything against it.
Read this thread and you'll find other examples of
patches that have been around for as long or longer
and also have some good use-cases and also have not
been merged.

____________________________________________________________________________________
Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts.
http://au.docs.yahoo.com/mail/unlimitedstorage.html
-

To: Nick Piggin <nickpiggin@...>, Andrew Morton <akpm@...>
Cc: <david@...>, Rene Herman <rene.herman@...>, Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Monday, August 6, 2007 - 5:55 am

On 8/6/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:

What do you think Andrew?
Swap prefetch is not the panacea, it's not going to solve all the
problems but it seems to improve the "desktop experience" and it has
been discussed and reviewed a lot (it's has even been discussed more
than it should have be).

Are you going to push upstream the patch?

Ciao,
--
Paolo
http://paolo.ciarrocchi.googlepages.com/
-

To: Rene Herman <rene.herman@...>
Cc: <david@...>, Daniel Hazelton <dhazelton@...>, Mike Galbraith <efault@...>, Andrew Morton <akpm@...>, Ingo Molnar <mingo@...>, Frank Kingswood <frank@...>, Andi Kleen <andi@...>, Nick Piggin <nickpiggin@...>, Ray Lee <ray-lk@...>, Jesper Juhl <jesper.juhl@...>, ck list <ck@...>, Paul Jackson <pj@...>, <linux-mm@...>, <linux-kernel@...>
Date: Saturday, July 28, 2007 - 7:21 am

> It is. Prefetched pages can be dropped on the floor without additional I/O.

Which is essentially free for most cases. In addition your disk access
may well have been in idle time (and should be for this sort of stuff)
and if it was in the same chunk as something nearby was effectively free
anyway.

Actual physical disk ops are precious resource and anything that mostly
reduces the number will be a win - not to stay swap prefetch is the right
answer but accidentally or otherwise there are good reasons it may happen
to help.

Bigger more linear chunks of writeout/readin is much more important I

I've been using it for months with no noticed problem. I turn it on
because it might as well get tested. I've not done comparison tests so I
can't comment on if its worth it.

Lots of -mm testers turn *everything* on because its a test kernel.

-