Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

Previous thread: Re: [PATCH] drivers: PMC MSP71xx LED driver by Marc St-Jean on Friday, March 9, 2007 - 3:20 pm. (2 messages)

Next thread: [PATCH] dio: invalidate clean pages before dio write by Zach Brown on Friday, March 9, 2007 - 3:35 pm. (2 messages)
From: Ashif Harji
Date: Friday, March 9, 2007 - 3:03 pm

Hi, I am encountering a performance problem, which I have tracked into the 
Linux kernel. The problem occurs with my experimental web server that uses 
sendfile to repeatedly transmit files.  The files are based on the static 
portion of the SPECweb99 fileset and range in size to model a reasonable 
workload.  With this workload, a significant number of the requests are 
for files of size 4 KB or less.

I have determined that the performance problems occurs in the function
do_generic_mapping_read in file mm/filemap.c for kernel version 2.6.20.1.
Here is the specific code fragment:

         /*
          * When (part of) the same page is read multiple times
          * in succession, only mark it as accessed the first time.
          */
         if (prev_index != index)
                 mark_page_accessed(page);


The implication of this code is that for files of size less than or equal 
to a single page, the page associated with such a file is likely to get 
evicted from the cache regardless of how frequently it is accessed.  The 
reason is that after the first access, prev_index is always zero and index 
can only be zero. Hence, mark_page_accessed is never called after the 
first time the file is requested.  As a result, the page is evicted from 
the cache no matter how frequently it is used.  By changing the kernel to 
always call mark_page_accessed for these files, the server throughput is 
increased by as much as 20%.

I was wondering if anyone could explain why the call to mark_page_accessed 
is conditional? That is, what problem it is trying to solve. It would seem 
that in many scenarios, if the same page is accessed repeatedly, then it 
would be appropriate to keep that page cached.

Please personally CC me on any responses.

thanks,
ashif.
-

From: Jan Kara
Date: Monday, March 12, 2007 - 7:20 am

Your analysis seems to be right. But to observe this behaviour you have
to have the file open and just always reread it using the same file
  I also don't know why the condition is there but it's there at least
for two years so I'm not sure anybody remembers ;). Nick, do you have
an idea?

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Nick Piggin
Date: Monday, March 12, 2007 - 7:39 am

Did it always use ra->prev_page? ISTR it using pos%PAGE_SIZE == 0 at some

Yeah it is there because that is basically how our "use once" detection
handles the case where an app does not read in chunks that are PAGE_SIZE
multiples and PAGE_SIZE aligned.

-

From: Jan Kara
Date: Monday, March 12, 2007 - 8:13 am

OK, I see. Then I'm not sure the check does more good than bad. Because
if we happen to reread the same chunk several times, then the check does a
wrong thing...

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Ashif Harji
Date: Monday, March 12, 2007 - 10:05 am

Thanks for providing me with additional information.

I would like to submit a patch to fix the performance problem.  The 
simplest solution is to remove the check.  Even in the situation where an 
application does not read in PAGE_SIZE multiples as described above, if 
the page is accessed frequently it should remain in the cache.  However, I 
am open to suggestions for a more sophisticated scheme.

ashif.
-

From: Jan Kara
Date: Monday, March 12, 2007 - 10:35 am

Well, yes, that's an obvious solution. I just wanted to make sure that
such change won't break some other load. But so far it seems it won't...

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Ashif Harji
Date: Tuesday, March 13, 2007 - 11:43 am

I would like to find out if the change I am suggesting will cause any 
other problems.  Is there someone else I should contact before submitting 
a patch?  Or should I just wait another day or two for comments?

ashif.
-

From: Jan Kara
Date: Tuesday, March 13, 2007 - 11:55 am

I don't think you'll get more feedback now. Just write the patch,
benchmark it with your workload and also some others (basically, there should be
no difference) and then submit the patch together with the performance
numbers and some rationale (if it's your first submission, see
Documentation/SubmittingPatches). Give also CC to Andrew <akpm@linux-foundation.org>
and ask him to put it into -mm kernels for testing.

									Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

From: Ashif Harji
Date: Wednesday, March 14, 2007 - 12:58 pm

This patch unconditionally calls mark_page_accessed to prevent pages, 
especially for small files, from being evicted from the page cache despite 
frequent access.

Signed-off-by: Ashif Harji <asharji@beta.uwaterloo.ca>

---

If the same page of a file is repeatedly accessed (without accessing other 
pages of that file) via the same file descriptor, mark_page_accessed is 
never called after the first time the page is accessed.

The implication of this code is that for files of size less than or equal 
to a single page, the page associated with such a file is likely to get 
evicted from the cache regardless of how frequently it is accessed. 
However, this behaviour also occurs with files of any size if the same 
page is repeatedly accessed.

As a benchmark, I have an experimental web server that uses sendfile to 
repeatedly transmit files.  The files are based on the static portion of 
the SPECweb99 fileset and range in size to model a reasonable workload. 
With this workload, a significant number of the requests are for files of 
size 4 KB or less.

By changing the kernel to always call mark_page_accessed, the server 
throughput is increased by as much as 20%.  With one test, for example, 
without the change I get throughput of around 868 Mbps.  After making the 
change, performance increases to 1111 Mbps.

Using a configuration that should be unaffected by the change, performance 
was around 855 Mbps without the change and around 851 Mbps with the 
change.  As expected the change had no appreciable effect.

See thread http://lkml.org/lkml/2007/3/9/403 for additional discussion on 
this change.

This patch is for kernel version 2.6.20.1.

Andrew, can you also put this change into the -mm kernels for testing?


--- linux-2.6.20.1/mm/filemap.c.orig	2007-03-14 10:31:58.000000000 -0500
+++ linux-2.6.20.1/mm/filemap.c	2007-03-13 16:11:54.000000000 -0500
@@ -943,12 +943,7 @@ page_ok:
  		if (mapping_writably_mapped(mapping))
  			flush_dcache_page(page);

-		/*
-		 ...
From: Dave Kleikamp
Date: Wednesday, March 14, 2007 - 1:55 pm

I guess the downside to this is if a reader is reading a large file, or
several files, sequentially with a small read size (smaller than
PAGE_SIZE), the pages will be marked active after just one read pass.
My gut says the benefits of this patch outweigh the cost.  I would
expect real-world backup apps, etc. to read at least PAGE_SIZE.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-

From: Andreas Mohr
Date: Wednesday, March 14, 2007 - 2:33 pm

Hi,


I also think that the patch is somewhat problematic, since the original
intention seems to have been a reduction of the number of (expensive?)
mark_page_accessed() calls, but this of course falls flat on its face in case
of permanent single-page accesses or accesses with progressing but very small
read size (single-byte reads or so), since the cached page content will expire
eventually due to lack of mark_page_accessed() updates; thus this patch
decided to call mark_page_accessed() unconditionally which may be a large
performance penalty for subsequent tiny-sized reads.

I've been thinking hard how to avoid the mark_page_accessed() starvation in
case of a fixed, (almost) non-changing access state, but this seems hard since
it'd seem we need some kind of state management here to figure out good
intervals of when to call mark_page_accessed() *again* for this page. E.g.
despite non-changing access patterns you could still call mark_page_accessed()
every 32 calls or so to avoid expiry, but this would need extra helper
variables.

A rather ugly way to do it may be to abuse ra.cache_hit or ra.mmap_hit content
with a
	if ((prev_index != index) || (ra.cache_hit % 32 == 0))
		mark_page_accessed(page);
This assumes that ra.cache_hit gets incremented for every access (haven't
checked whether this is the case).
That way (combined with an enhanced comment properly explaining the dilemma)
you would avoid most mark_page_accessed() invocations of subsequent same-page reads
but still do page status updates from time to time to avoid page deprecation.

Does anyone think this would be acceptable? Any better idea?

Andreas Mohr

P.S.: since I'm not too familiar with this area I could be rather wrong after all...
-

From: Dave Kleikamp
Date: Wednesday, March 14, 2007 - 3:08 pm

mark_page_accessed() isn't expensive.  If called repeatedly, starting
with the third call, it will check two page flags and return.  The only
real expense is that the page appears busier than it may be and will be

Any application doing many tiny-sized reads isn't exactly asking for

I wouldn't go looking for anything more complicated than Ashif's patch,

I could be missing something as well.  :-)

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-

From: Xiaoning Ding
Date: Wednesday, March 14, 2007 - 6:36 pm

If we allow mark_page_accessed() called multiple times for a single page,
a scan of large file with small-size reads would flush the buffer cache.
mark_page_accessed() also requests lru_lock when moving page from

-

From: Ashif Harji
Date: Wednesday, March 14, 2007 - 10:22 pm

The problem with the existing logic is that it is too coarse.  In trying 
to deal with one usage pattern it is negatively impacting performance for 
other reasonable access patterns.

Further, consider the extreme case of scanning a file 1 byte at a time. 
In this case, you are going to access a page over 4000 times, but that 
page is not going to be marked as active and hence that page is likely to 
be evicted from the cache.  Clearly, there are cases when scanning a file 
that you would like the pages to be kept in the cache.

Finally, the existing code is problematic as there is no reasonable way to 
circumvent the negative impact for small files.

Hence, I think a change is necessary.  The question is whether the 
intent of conditionally calling mark_page_accessed() is still reasonable 
and whether the amount of bookkeeping required to detect that usage 
pattern but not create a problem for other usage patterns is reasonable.

I would tend to agree with David that:  "Any application doing many 
tiny-sized reads isn't exactly asking for great performance."  As well, 
applications concerned with performance and caching problems can read in a 
file in PAGE_SIZE chunks.  I still think the simple fix of removing the 
condition is the best approach, but I'm certainly open to alternatives.

ashif.
-

From: Dave Kleikamp
Date: Thursday, March 15, 2007 - 5:46 am

A possible alternative might be to store the offset within the page in
the readahead state, and call mark_page_accessed() when the read offset
is less than or equal to the previous offset.

-- 
David Kleikamp
IBM Linux Technology Center

-

From: Nick Piggin
Date: Thursday, March 15, 2007 - 5:50 am

That could be a good idea.

We definitely want to look at ways to solve with within the existing
approach before any large scale change in behaviour.

-

From: Andrew Morton
Date: Thursday, March 15, 2007 - 12:07 pm

Yes, the problem of falsely activating pages when the file is read in small
hunks is worse than the problem which your patch fixes.

We could change it so that if the current read() includes the zeroeth byte
of the page, we run mark_page_accessed() even if this_page==prev_page?
-

From: Andrea Arcangeli
Date: Thursday, March 15, 2007 - 2:49 pm

Really? I would have expected all performance sensitive apps to read
in >=PAGE_SIZE chunks. And if they don't because they split their
dataset in blocks (like some database), it may not be so wrong to
activate those pages that have two "hot" blocks more aggressively than
those pages with a single hot block.

So I've an hard time to advocate to prefer the current behavior, but
certainly this can be "fixed" by caching the last_offset like others
pointed out ;)
-

From: Andrew Morton
Date: Thursday, March 15, 2007 - 3:06 pm

On Thu, 15 Mar 2007 22:49:23 +0100

But the problem which is being fixed here is really obscure: an application
repeatedly reading the first page and only the first page of a file, always
via the same fd.

I'd expect that the sub-page-size read scenarion happens heaps more often
than that, especially when dealing with larger PAGE_SIZEs.


-

From: Andrea Arcangeli
Date: Thursday, March 15, 2007 - 4:15 pm

Whatever that app is doing, clearly we have to keep those 4k in cache!
Like obviously the specweb demonstrated that as long as you are
_repeating_ the same read, it's correct to activate the page even if
it was reading from the same page as before.

What is wrong is to activate the page more aggressively if it's
_different_ parts of the page that are being read in a contiguous
way. I thought that the whole point of the ra.prev_page was to detect
_contiguous_ (not random) I/O made with a small buffer, anything else
doesn't make much sense to me.

In short I think taking a ra.prev_offset into account as suggested by
Dave Kleikamp is the best, it may actually benefit the obscure app too ;)
-

From: Rik van Riel
Date: Thursday, March 15, 2007 - 8:00 am

Like this? :)

http://surriel.com/patches/clockpro/2.6.12/useonce-cleanup

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Valdis.Kletnieks
Date: Thursday, March 15, 2007 - 10:37 am

What if you did something like

	if (jiffies%32) {...

(Possibly scaling it so the low-order bits change).  No need to lock it, as
"right most of the time" is close enough.

From: Rik van Riel
Date: Thursday, March 15, 2007 - 11:35 am

Bad idea.  That way you would only count page accesses if the
phase of the moon^Wjiffie is just right.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Valdis.Kletnieks
Date: Thursday, March 15, 2007 - 8:51 pm

On the other hand, Andreas suggested only marking it once every 32 calls,
but that required a helper variable.  Statistically, jiffies%32 should
end up about the same as a helper variable %32.

This of course, if just calling mark_page_accessed() is actually expensive
enough that we don't want to do it unconditionally.

From: Rik van Riel
Date: Thursday, March 15, 2007 - 9:09 pm

Not caching a needed page and having to wait for a disk seek
to complete will be *way* more expensive than any call to
mark_page_accessed().

A modern CPU can do somewhere on the order of 50 million
instructions in the time it takes to bring one page in from
disk.

However, this does not mean we should unconditionally call
mark_page_accessed(), since that could cause use to push
wanted data out of the cache because of one program that
does its streaming accesses in a strange way...

This is a situation where getting it right almost certainly
matters.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Anton Blanchard
Date: Friday, March 16, 2007 - 7:20 am

PAGE_SIZE being 8k on sparc, 16-64k on ia64 and potentially 64kb on
powerpc :)

Id expect a large percentage of files to be below that size.

Anton
-

From: Peter Zijlstra
Date: Thursday, March 15, 2007 - 3:39 am

Since we're hackling over the use-once stuff again...

/me brings up: http://marc.info/?l=linux-mm&m=115316894804385&w=2 and
ducks.

-

From: Nick Piggin
Date: Thursday, March 15, 2007 - 5:38 am

Join the club ;) 

http://groups.google.com.au/group/linux.kernel/msg/7b3237b8e715475b?hl=en&

I can't find the patch where I actually did combine it with a PG_usedonce
bit, but the end result is pretty similar to your patch. And I think one
or two others have also independently invented the same thing.

So it *has* to be good, doesn't it? ;)
-

From: Rik van Riel
Date: Thursday, March 15, 2007 - 8:06 am

Except for the bit where vmscan throws away page cache pages
with PageReferenced set, unless they are mapped :)

Also, pages that only got accessed once will have the referenced
bit set too, so your patch gives no way to distinguish between
pages that were accessed once and pages that were accessed multiple

No argument there :)

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Chuck Ebbert
Date: Thursday, March 15, 2007 - 8:56 am

I like mine better -- it leaves the comment:


From: Nick Piggin
Date: Thursday, March 15, 2007 - 9:29 am

How about this? It also doesn't break the use-once heuristic.

--
A change to make database style random read() workloads perform better, by
calling mark_page_accessed for some non-page-aligned reads broke the case of
< PAGE_CACHE_SIZE files, which will not get their prev_index moved past the
first page.

Combine both heuristics for marking the page accessed.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -929,7 +929,7 @@ page_ok:
 		 * When (part of) the same page is read multiple times
 		 * in succession, only mark it as accessed the first time.
 		 */
-		if (prev_index != index)
+		if (prev_index != index || !offset)
 			mark_page_accessed(page);
 		prev_index = index;
 
-

From: Rik van Riel
Date: Thursday, March 15, 2007 - 10:04 am

Acked-by: Rik van Riel <riel@redhat.com>

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Hugh Dickins
Date: Thursday, March 15, 2007 - 10:44 am

Yeah, yeah, I'm not a real mman, I don't have my own patch and
website for this ;) but I'm old, let me mumble some history...

Ashif's patch would take us back to 2.4.10 when mark_page_accessed
was introduced: in 2.4.11 someone (probably Andrea) immediately
added a !offset || !filp->f_reada condition on it there, which
remains in 2.4 to this day.  That _probably_ means that Ashif's
patch is suboptimal, and that your !offset patch is good.

f_reada went away in 2.5.8, and the !offset condition remained
until 2.6.11, when Miquel (CC'ed) replaced it by today's prev_index
condition.  His changelog entry appended below.  Since it's Miquel
who removed the !offset condition, he should be consulted on its
reintroduction.


Miquel's patch comment from ChangeLog-2.6.11:

[PATCH] mark_page_accessed() for read()s on non-page boundaries

When reading a (partial) page from disk using read(), the kernel only marks
the page as "accessed" if the read started at a page boundary.  This means
that files that are accessed randomly at non-page boundaries (usually
database style files) will not be cached properly.

The patch below uses the readahead state instead.  If a page is read(), it
is marked as "accessed" if the previous read() was for a different page,
whatever the offset in the page.

Testing results:


- Boot kernel with mem=128M

- create a testfile of size 8 MB on a partition. Unmount/mount.

- then generate about 10 MB/sec streaming writes

	for i in `seq 1 1000`
	do
		dd if=/dev/zero of=junkfile.$i bs=1M count=10
		sync
		cat junkfile.$i > /dev/null
		sleep 1
	done

- use an application that reads 128 bytes 64000 times from a
  random offset in the 64 MB testfile.

1. Linux 2.6.10-rc3 vanilla, no streaming writes:

# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.22s system 5% cpu 4.456 total

2. Linux 2.6.10-rc3 vanilla, streaming writes:

# time ~/rr testfile
Read 128 bytes 64000 times
~/rr testfile  0.03s user 0.16s system 2% ...
From: Nick Piggin
Date: Thursday, March 15, 2007 - 1:01 pm

Yeah I did go back and check up on that changelog, because I knew
we had a !offset check there at one stage, which is immune to this
problem (or at least can handle it a little better).

I suspect that Miquel was probably more interested in _increasing_
mark_page_accessed coverage with his new condition than restricting
it from the !offset cases.

Thanks for digging it up and posting here, though.
-

From: Andrea Arcangeli
Date: Thursday, March 15, 2007 - 3:59 pm

the !offset check looks a pretty broken heuristic indeed, it would
break random I/O. The real fix is to add a ra.prev_offset along with
ra.prev_page, and if who implements it wants to be stylish he can as
well use a ra.last_contiguous_read structure that has a page and
offset fields (and then of course remove ra.prev_page).
-

From: Dave Kleikamp
Date: Thursday, March 15, 2007 - 4:15 pm

I wouldn't call it broken.  At worst, I'd say it's imperfect.  But
that's the nature of a heuristic.  It most likely works in a huge

I suggested something along these lines, but I wonder if it's overkill.
The !offset check is simple and appears to be a decent improvement over
the current code.
-- 
David Kleikamp
IBM Linux Technology Center

-

From: Andrea Arcangeli
Date: Thursday, March 15, 2007 - 4:28 pm

well, IMHO in the huge majority of cases the prev_page check isn't
necessary in the first place (and IMHO it hurts a lot more than it can
help, as demonstrated by specweb, since we'll bite on the good guys to
help the bad guys).

The only case where I can imagine the prev_page to make sense is to
handle contiguous I/O made with a small buffer, so clearly an
inefficient code in the first place. But if this guy is reading with
<PAGE_SIZE buffer there's no guarantee that he's reading f_pos aligned
either, hence the need of taking last_offset into account too so at
least it's a "perfect" heuristic that will reliably detect contiguous
I/O no matter in what shape or form you execute it, as long as it is
contiguous I/O.

Any other variation of behavior besides the autodetection of
contiguous I/O run in whatever buffer/aligned form, should be mandated
by userland through fadvise/madvise IMHO or we run into the toes of
the good guys.
-

From: Ashif Harji
Date: Thursday, March 15, 2007 - 12:55 pm

It sounds like people are happy with the fix suggested by Nick.  That fix 
is okay with me as it fixes the problem I am having.

I suspect, however, that by not directly detecting the problematic access 
pattern, where the file is accessed sequentially in small hunks, other 
applications may experience performance problems related to caching. For 
example, if an application frequently and non-sequentially reads from the 
same page.  This is especially true for files of size < PAGE_CACHE_SIZE.
But, I'm not sure if such an access pattern likely.

ashif.
-

From: Nick Piggin
Date: Thursday, March 15, 2007 - 1:07 pm

Well in general we like to help applications that help themselves. It
is actually a good heuristic, surprisingly. If an application randomly
accesses the same page (and there is no write activity going on), then
it would be better off to cache it in userspace, and if it doesn't care
to do that then it won't mind having to read it off disk now and again :)

-

From: Andreas Mohr
Date: Thursday, March 15, 2007 - 1:31 pm

Hi,


Sounds like a good plan since this probably is a nice way to make stupid apps
doing stupid things sit up and take notice, and maybe the authors will then go
so far as fixing up a few more things that are hurting them once they actually
recognize that something is weird due to overly bad performance.

Why go to great lengths to support stupid apps when there are still so many things
which could be done to help well-behaving ones? ;)

Andreas Mohr
-

From: Ashif Harji
Date: Monday, March 12, 2007 - 9:46 am

Yes, this problem requires the file to be accessed via the same file 
descriptor.  I suspect this behaviour is common for a certain class of 
applications, e.g., web servers.  However, any application that read from 
the same page repeatedly would also experience this problem.  It is more 
likely to occur with small files, but not exculsively.

ashif.
-

Previous thread: Re: [PATCH] drivers: PMC MSP71xx LED driver by Marc St-Jean on Friday, March 9, 2007 - 3:20 pm. (2 messages)

Next thread: [PATCH] dio: invalidate clean pages before dio write by Zach Brown on Friday, March 9, 2007 - 3:35 pm. (2 messages)