Linux: 2.5.39+ VM to go "non blocking"

Submitted by Anonymous
on September 23, 2002 - 6:13am

Linus has started integrating some of Andrew Morton's mm patches into his BK repository as usual - what is interesting about this round of merges is the inclusion of Andrew's most radical changes - a non-blocking page writeback system. The changelogs are very detailed and interesting so I won't explain further.

The goal of this system is mainly to improve multi threaded IO performance and is very exciting as it looks like akpm & co will push 2.6 (close) to the top of the VM heap!
From the BK changelog:

*******************************************************
Example: with `mem=512m', running 4 instances of
`dbench 100', 2.5.34 took 35 minutes to compile a kernel.
With this patch, it took three minutes, 45 seconds.
*******************************************************
*******************************************************
This code can keep sixty spindles saturated - we've never
been able to do that before.
*******************************************************

Another significant akpm mm patch which should get merged soon: read-latency.patch

*******************************************************
On IDE it provides a 100x improvement in read throughput
when there is heavy writeback happening.  40x on SCSI.
*******************************************************

... which is important because ...

*******************************************************
This problem has become much more severe lately because the
VM is now capable of keeping the write queue full all the
time, without stumbling over its own feet.  (In fact it can
keep 10 queues saturated, and most likely 100).
*******************************************************

Obvious, and perhaps stupid a

Anonymous
on
September 23, 2002 - 8:27pm

Obvious, and perhaps stupid and naive question: So what happens when the VM decides it wants one of those pages back that's in the write queue? Is it able to just pull it back off the queue then?

I'm used to dealing with hardware caches, and read misses are often made more expensive by deep write queues. Either the read miss consults the write buffer directly, or it blindly waits for it to drain. (The system I work on does the latter.)

I suppose a kernel write queue, being software, is a bit more flexible?

Well

Anonymous
on
September 24, 2002 - 12:02am

(AFAIK)

Remember, pages in memory are always valid (most up to date) nonwithstanding CPU caches which are handled by coherency...

Pages will be put on the write queue only if they are dirty (obviously) and for a number of reasons: the VM needs memory and the replacement algo chooses this page, it has been dirty for a specified time, or a sync is called.

In the situation of either of the two latter cases, the page is always in memory and can be used any time.

In the former case, the page should be accessable while it is on the write queue (AFAIK the VM will pick up this access and no longer evict the page after it is written). After the page is evicted, however, it needs to be read in from disk upon use - this is an unavoidable situation, though a good replacement policy should minimise the frequency of this happening.

ahh, ok: writes don't imply invalidate.

Anonymous
on
September 24, 2002 - 9:54am


Pages will be put on the write queue only if they are dirty (obviously) and for a number of reasons: the VM needs memory and the replacement algo chooses this page, it has been dirty for a specified time, or a sync is called.


I guess the main difference, then, is that pages are put on the write queue solely because they're dirty, not because they're being removed from memory. That is, writing out a page only makes it clean, it doesn't necessarily imply it's being victimized. In modern hardware caches, writeouts generally occur due to victim writebacks (data is being evicted from the cache to make room for other data), or due to write misses in a cache that does not write-allocate. (That assumes a writeback cache. Write-thru hardware caches are closer to what the VM's doing, but still not quite.) Thus, I was thinking in terms of "writeback implies invalidate", which is what a hardware cache often does, but not really applicable to the VM.

What led to my confusion is that writeouts generally happen when things are getting "swapped out", so the process of discarding a page includes the task of writing it out to disk if it's dirty. What you're saying is that the two aren't really tied at the hip, even if one process requires the other.

Thanks.

--Joe

I think you are correct.

Anonymous
on
September 26, 2002 - 2:00am

So is the kernel constantly doing page walks to see if pages are dirty and then start writing them to disk just for fun? I could see the benefit of doing that but you are always goign to have a lot of dirty pages. Buffers on hard drives are only so big. Most L2 cache lines are only 64 bytes wide. That is a lot of transactions on the FSB just for the sake of writing dirty pages. Granted you would eventually have to write them pages anyhow, but does it save you that much time if you are doing it all the time or only doing when you are swapping pages in and out? It is interesting.

RE: I think you are correct

Anonymous
on
September 26, 2002 - 2:29am

Well, there is a tradeoff here. At first thought you'd think that keeping dirty pages in memory for as long as possible (until page reclaim due to low memory) would give you the best throughput as multiple (thousands of) writes can go to the page and it is only physically written once. It turns out that this is not nessicarily so, because if the disk is idle then it is usually better to give it something to chew on.

While the concept of a VMM is fairly simple - a real system (eg Linux) is incredibly difficult to predict / model even for those with quite a bit of knowledge - a lot of the development and probably 100% of the "tuning" is simply the result of a lot of testing.

And I'm not really sure why you bring CPU caches into it - at this level of the kernel, the CPU caches are transparent - this is solely main memory disk stuff.

CPU caches are relevant in two ways

Anonymous
on
September 26, 2002 - 9:57pm

On one level, comparing to a CPU cache is relevant as the principle is similar: The memory available to a program is constructed of a hierarchy of storage devices. At the top of the hierarchy are CPU registers, followed by the level 1 cache, level 2 cache, (up to level N hardware cache), system memor(ies), and at the bottom, the hard drive. The VMM in the OS manages the hard-drive <--> system memory aspect of things.

On the other level, thrashing the CPU's caches by preemptively cycling through kernel internal data structures isn't a great way to speed up a system. I believe that's the other thing the poster was talking about when he/she mentioned cache line sizes. Cache pollution is a great performance degrader.

My (perhaps naive) understanding of why it's a good idea to push dirty pages out regularly is that (a) syncing things regularly cuts your losses if you ever have an equipment failure, and (b) it bounds the amount of data you'll have to deal with in the event you need to get rid of some of it. Keeping idle disks busy is nice and all, but if something suddenly needs that disk, it'll take a latency hit waiting behind your preemptive writes. It's all a balancing act.

--Joe

Hmm

Anonymous
on
September 26, 2002 - 11:13pm

> On one level, comparing to a CPU cache is relevant as the principle
> is similar: The memory available to a program is constructed of a
> hierarchy of storage devices. At the top of the hierarchy are CPU
> registers, followed by the level 1 cache, level 2 cache, (up to
> level N hardware cache), system memor(ies), and at the bottom, the
> hard drive. The VMM in the OS manages the hard-drive system
> memory aspect of things.

No doubt about that

> On the other level, thrashing the CPU's caches by preemptively
> cycling through kernel internal data structures isn't a great way to
> speed up a system. I believe that's the other thing the poster was
> talking about when he/she mentioned cache line sizes. Cache
> pollution is a great performance degrader.

Actually when memory gets low and writeout is needed, the few thousand cycle gain you had in a few more cache hits is offset by the many million cycles stall waiting on IO - if you write it out before nessicary, the CPU can continue doing stuff while the disk does IO this is the point of async (non-blocking) writeout.

> My (perhaps naive) understanding of why it's a good idea to push
> dirty pages out regularly is that (a) syncing things regularly cuts
> your losses if you ever have an equipment failure, and (b) it bounds
> the amount of data you'll have to deal with in the event you need to
> get rid of some of it. Keeping idle disks busy is nice and all, but
> if something suddenly needs that disk, it'll take a latency hit
> waiting behind your preemptive writes. It's all a balancing act.

This latency problem is why the latency "hack" mentioned in the article was needed (linus's tree now has a new elevator).

The problem is when there's nothing dirty yet.

Anonymous
on
September 27, 2002 - 5:16am

Actually when memory gets low and writeout is needed, the few thousand cycle gain you had in a few more cache hits is offset by the many million cycles stall waiting on IO - if you write it out before nessicary, the CPU can continue doing stuff while the disk does IO this is the point of async (non-blocking) writeout.

I agree. The main thing is scanning to find dirty data to write out. Scanning continuously but not finding anything new to write out only thrashes the CPU cache and hurts overall system performance. That's why pdflush has thresholds for how much dirty data and how old. Don't even bother trying to push out anything dirty until you know you'll be productive about it.

Contest results for 2.5.38-mm2

nimrod
on
September 24, 2002 - 2:27am

Con Kolivas posted updated contest results for 2.5.38-mm2, and the results look good:

From: Con Kolivas
To: linux-kernel
Subject: [BENCHMARK] 2.5.38-mm2 contest results
Date: Mon, 23 Sep 2002 17:24:14 +1000

Here follow the contest benchmarks for 2.5.38-mm2

NoLoad:
Kernel                  Time            CPU
2.4.19                  66.56           99%
2.5.38                  68.25           99%
2.5.38-mm1              67.17           99%
2.5.38-mm2              67.48           99%

Process Load:
Kernel                  Time            CPU
2.4.19                  81.29           80%
2.5.38                  71.60           95%
2.5.38-mm1              70.49           95%
2.5.38-mm2              70.82           95%

IO Half Load:
Kernel                  Time            CPU
2.4.19                  101.39          69%
2.5.38                  81.26           90%
2.5.38-mm1              82.52           87%
2.5.38-mm2              78.46           91%

IO Full Load:
Kernel                  Time            CPU
2.4.19                  170.70          41%
2.5.38                  170.21          42%
2.5.38-mm1              434.41          16%
2.5.38-mm2              108.15          66%

Mem Load:
Kernel                  Time            CPU
2.4.19                  93.33           77%
2.5.38                  104.22          70%
2.5.38-mm1              92.97           77%
2.5.38-mm2              90.89           80%

As akpm has said, mm2 should fix the write starves read problem in mm2 and this
is clearly shown in the IO full load results being substantially better.
This is on an IDE system.

Other results removed for clarity. All tests are done with gcc2.95.3 :\

Con.

Silly Question

Anonymous
on
September 24, 2002 - 5:10am

But is this a patch towards his own AA VM or the RMAP one? I'd hope the latter since it seems rmap will be 2.6's kernel.

rmap already went in.

Anonymous
on
September 24, 2002 - 6:31am

As I understand it, most parts of rmap have already gone in.

If I look at the status page (http://kernelnewbies.org/status/latest.html), one of the points there says:

o in 2.5.27+ New VM with reverse mappings (Rik van Riel)

*his* own AA VM?

Anonymous
on
September 24, 2002 - 9:58am

AA = Andrea Arcangeli
AKPM = Andrew Morton

Hmmm

Anonymous
on
September 24, 2002 - 9:06am

This reminds me of how MS used to claim that FAT32 gave you 25% more disk space than plain old FAT. In other words: "Now we piss away less of your system's resources. We suck less than we used to!" If these kinds of improvements are still possible with a relatively minor software change, what light does that shine on claims that Linux I/O didn't suck before? Everybody who knew storage knew that Linux was terrible at keeping disks busy, but they were shouted down by the zealots.

Ummm

Anonymous
on
September 24, 2002 - 10:54pm

This wasn't a relatively minor change really - Andrew has been working toward this for most of 2.5. And thats what you get for preaching to the zealots, isn't it? Everyone on lkml (ie people who know what they're talking about) knew that the VM and IO subsystems were pretty bad when 2.5 forked - thats why so much work has been going into them. And terrible it may have been, but relative to things like windows it must have been better because you don't see any MS comparisons of disk system performance, do you?

Amdahl's Law at work

Anonymous
on
September 26, 2002 - 10:06pm

You also have got to remember that "bad" is all relative. We're up against Amdahl's Law here. If we take the worst thing in the kernel and make it perfect, it makes everything else that was even slightly bad look worse. SMP and related locking issues were the big thing everyone focused on for awhile. Now everyone's focusing on getting the VM and I/O up to speed since most of the other major subsystems are fine.

Big whoop-te-do.

Also keep in mind that the weighting function changes with time. Systems are getting bigger, software is getting bigger, and many of these changes are aimed at making Linux scale to those big boxes. How many desktop machines run a workload at all similar to 'contest'? How many user workstations will have 60 disks attached that need to be saturated? Linux is becoming more popular on these systems. Before, it wasn't as much of a concern.

I've been using Linux since 0.99.14, and sure, there have been no shortage of warts in the last 9 years, but for typical usage (and even occasionally atypical usage), it hasn't been too bad. (Heck, like back in the 1.2 days, when I used to run huge Spice simulations on my poor 52MB RAM 486 machine and go about 350MB into swap.... Ahh, college days...)

Workload similar to contest?

Con Kolivas
on
September 29, 2002 - 2:50am

Err if I'm not mistaken the point of contest was to recreate situations that cause real slow downs on desktop systems. These are real desktop workload situations, sustained for the duration of the benchmark to increase the signal to noise ratio of the benchmark.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.