On Fri, 5 Oct 2007 20:30:28 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:Sure, but we don't have one disk queue per disk per zone! The queue is shared by all the zones. So if writeback from one zone has filled the queue up, the kernel can't write back data from another zone. (Well, it can, by blocking in get_request_wait(), but that causes long and uncontrollable latencies). Or someone ran fsync(), or pdflush is writing back data because it exceeded dirty_writeback_centisecs, etc. Yeah. In 2.4 and early 2.5, page-reclaim (both direct reclaim and kswapd, iirc) would throttle by waiting on writeout of a particular page. This was a poor design, because writeback against a *particular* page can take anywhere from one millisecond to thirty seconds to complete, depending upon where the disk head is and all that stuff. The critical change I made was to switch the throttling algorithm from "wait for one page to get written" to "wait for _any_ page to get written". Becaue reclaim really doesn't care _which_ page got written: we want to wake up and start scanning again when _any_ page got written. That's what congestion_wait() does. It is pretty crude. It could be that writeback completed against pages which aren't in the correct zone, or it could be that some other task went and allocated the just-cleaned pages before this task can get running and reclaim them, or it could be that the just-written-back pages weren't reclaimable after all, etc. It would take a mind-boggling amount of logic and locking to make all this 100% accurate and the need has never been demonstrated. So page reclaim presently should be viewed as a polling algorithm, where the rate of polling is paced by the rate at which the IO system can retire writes. Something like that. The critical numbers to watch are /proc/vmstat's *scan* and *steal*. Look: akpm:/usr/src/25> uptime 10:08:14 up 10 days, 16:46, 15 users, load average: 0.02, 0.05, 0.04 akpm:/usr/src/25> grep steal /proc/vmstat pgsteal_dma 0 pgsteal_dma32 0 pgsteal_normal 0 pgsteal_high 0 pginodesteal 0 kswapd_steal 1218698 kswapd_inodesteal 266847 akpm:/usr/src/25> grep scan /proc/vmstat pgscan_kswapd_dma 0 pgscan_kswapd_dma32 1246816 pgscan_kswapd_normal 0 pgscan_kswapd_high 0 pgscan_direct_dma 0 pgscan_direct_dma32 448 pgscan_direct_normal 0 pgscan_direct_high 0 slabs_scanned 2881664 Ignore kswapd_inodesteal and slabs_scanned. We see that this machine has scanned 1246816+448 pages and has reclaimed (stolen) 1218698 pages. That's a reclaim success rate of 97.7%, which is pretty damn good - this machine is just a lightly-loaded 3GB desktop. When testing reclaim, it is critical that this ratio be monitored (vmmon.c from ext3-tools is a vmstat-like interface to /proc/vmstat). If the reclaim efficiency falls below, umm, 25% then things are getting into some trouble. Actually, 25% is still pretty good. We scan 4 pages for each reclaimed page, but the amount of wall time which that takes is vastly less than the time to write one page, bearing in mind that these things tend to be seeky as hell. But still, keeping an eye on the reclaim efficiency is just your basic starting point for working on page reclaim. -
| Andy Whitcroft | Re: 2.6.23-rc6-mm1 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Alan | Re: [RFC] Heads up on sys_fallocate() |
git: | |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Winkler, Tomas | RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
