Hi, I am encountering a performance problem, which I have tracked into the
Linux kernel. The problem occurs with my experimental web server that uses
sendfile to repeatedly transmit files. The files are based on the static
portion of the SPECweb99 fileset and range in size to model a reasonable
workload. With this workload, a significant number of the requests are
for files of size 4 KB or less.
I have determined that the performance problems occurs in the function
do_generic_mapping_read in file mm/filemap.c for kernel version 2.6.20.1.
Here is the specific code fragment:
/*
* When (part of) the same page is read multiple times
* in succession, only mark it as accessed the first time.
*/
if (prev_index != index)
mark_page_accessed(page);
The implication of this code is that for files of size less than or equal
to a single page, the page associated with such a file is likely to get
evicted from the cache regardless of how frequently it is accessed. The
reason is that after the first access, prev_index is always zero and index
can only be zero. Hence, mark_page_accessed is never called after the
first time the file is requested. As a result, the page is evicted from
the cache no matter how frequently it is used. By changing the kernel to
always call mark_page_accessed for these files, the server throughput is
increased by as much as 20%.
I was wondering if anyone could explain why the call to mark_page_accessed
is conditional? That is, what problem it is trying to solve. It would seem
that in many scenarios, if the same page is accessed repeatedly, then it
would be appropriate to keep that page cached.
Please personally CC me on any responses.
thanks,
ashif.
-
Your analysis seems to be right. But to observe this behaviour you have to have the file open and just always reread it using the same file I also don't know why the condition is there but it's there at least for two years so I'm not sure anybody remembers ;). Nick, do you have an idea? Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Did it always use ra->prev_page? ISTR it using pos%PAGE_SIZE == 0 at some Yeah it is there because that is basically how our "use once" detection handles the case where an app does not read in chunks that are PAGE_SIZE multiples and PAGE_SIZE aligned. -
OK, I see. Then I'm not sure the check does more good than bad. Because if we happen to reread the same chunk several times, then the check does a wrong thing... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Thanks for providing me with additional information. I would like to submit a patch to fix the performance problem. The simplest solution is to remove the check. Even in the situation where an application does not read in PAGE_SIZE multiples as described above, if the page is accessed frequently it should remain in the cache. However, I am open to suggestions for a more sophisticated scheme. ashif. -
Well, yes, that's an obvious solution. I just wanted to make sure that such change won't break some other load. But so far it seems it won't... Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
I would like to find out if the change I am suggesting will cause any other problems. Is there someone else I should contact before submitting a patch? Or should I just wait another day or two for comments? ashif. -
I don't think you'll get more feedback now. Just write the patch, benchmark it with your workload and also some others (basically, there should be no difference) and then submit the patch together with the performance numbers and some rationale (if it's your first submission, see Documentation/SubmittingPatches). Give also CC to Andrew <akpm@linux-foundation.org> and ask him to put it into -mm kernels for testing. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
This patch unconditionally calls mark_page_accessed to prevent pages, especially for small files, from being evicted from the page cache despite frequent access. Signed-off-by: Ashif Harji <asharji@beta.uwaterloo.ca> --- If the same page of a file is repeatedly accessed (without accessing other pages of that file) via the same file descriptor, mark_page_accessed is never called after the first time the page is accessed. The implication of this code is that for files of size less than or equal to a single page, the page associated with such a file is likely to get evicted from the cache regardless of how frequently it is accessed. However, this behaviour also occurs with files of any size if the same page is repeatedly accessed. As a benchmark, I have an experimental web server that uses sendfile to repeatedly transmit files. The files are based on the static portion of the SPECweb99 fileset and range in size to model a reasonable workload. With this workload, a significant number of the requests are for files of size 4 KB or less. By changing the kernel to always call mark_page_accessed, the server throughput is increased by as much as 20%. With one test, for example, without the change I get throughput of around 868 Mbps. After making the change, performance increases to 1111 Mbps. Using a configuration that should be unaffected by the change, performance was around 855 Mbps without the change and around 851 Mbps with the change. As expected the change had no appreciable effect. See thread http://lkml.org/lkml/2007/3/9/403 for additional discussion on this change. This patch is for kernel version 2.6.20.1. Andrew, can you also put this change into the -mm kernels for testing? --- linux-2.6.20.1/mm/filemap.c.orig 2007-03-14 10:31:58.000000000 -0500 +++ linux-2.6.20.1/mm/filemap.c 2007-03-13 16:11:54.000000000 -0500 @@ -943,12 +943,7 @@ page_ok: if (mapping_writably_mapped(mapping)) flush_dcache_page(page); - /* - ...
I guess the downside to this is if a reader is reading a large file, or several files, sequentially with a small read size (smaller than PAGE_SIZE), the pages will be marked active after just one read pass. My gut says the benefits of this patch outweigh the cost. I would expect real-world backup apps, etc. to read at least PAGE_SIZE. Shaggy -- David Kleikamp IBM Linux Technology Center -
Hi, I also think that the patch is somewhat problematic, since the original intention seems to have been a reduction of the number of (expensive?) mark_page_accessed() calls, but this of course falls flat on its face in case of permanent single-page accesses or accesses with progressing but very small read size (single-byte reads or so), since the cached page content will expire eventually due to lack of mark_page_accessed() updates; thus this patch decided to call mark_page_accessed() unconditionally which may be a large performance penalty for subsequent tiny-sized reads. I've been thinking hard how to avoid the mark_page_accessed() starvation in case of a fixed, (almost) non-changing access state, but this seems hard since it'd seem we need some kind of state management here to figure out good intervals of when to call mark_page_accessed() *again* for this page. E.g. despite non-changing access patterns you could still call mark_page_accessed() every 32 calls or so to avoid expiry, but this would need extra helper variables. A rather ugly way to do it may be to abuse ra.cache_hit or ra.mmap_hit content with a if ((prev_index != index) || (ra.cache_hit % 32 == 0)) mark_page_accessed(page); This assumes that ra.cache_hit gets incremented for every access (haven't checked whether this is the case). That way (combined with an enhanced comment properly explaining the dilemma) you would avoid most mark_page_accessed() invocations of subsequent same-page reads but still do page status updates from time to time to avoid page deprecation. Does anyone think this would be acceptable? Any better idea? Andreas Mohr P.S.: since I'm not too familiar with this area I could be rather wrong after all... -
mark_page_accessed() isn't expensive. If called repeatedly, starting with the third call, it will check two page flags and return. The only real expense is that the page appears busier than it may be and will be Any application doing many tiny-sized reads isn't exactly asking for I wouldn't go looking for anything more complicated than Ashif's patch, I could be missing something as well. :-) Shaggy -- David Kleikamp IBM Linux Technology Center -
If we allow mark_page_accessed() called multiple times for a single page, a scan of large file with small-size reads would flush the buffer cache. mark_page_accessed() also requests lru_lock when moving page from -
The problem with the existing logic is that it is too coarse. In trying to deal with one usage pattern it is negatively impacting performance for other reasonable access patterns. Further, consider the extreme case of scanning a file 1 byte at a time. In this case, you are going to access a page over 4000 times, but that page is not going to be marked as active and hence that page is likely to be evicted from the cache. Clearly, there are cases when scanning a file that you would like the pages to be kept in the cache. Finally, the existing code is problematic as there is no reasonable way to circumvent the negative impact for small files. Hence, I think a change is necessary. The question is whether the intent of conditionally calling mark_page_accessed() is still reasonable and whether the amount of bookkeeping required to detect that usage pattern but not create a problem for other usage patterns is reasonable. I would tend to agree with David that: "Any application doing many tiny-sized reads isn't exactly asking for great performance." As well, applications concerned with performance and caching problems can read in a file in PAGE_SIZE chunks. I still think the simple fix of removing the condition is the best approach, but I'm certainly open to alternatives. ashif. -
A possible alternative might be to store the offset within the page in the readahead state, and call mark_page_accessed() when the read offset is less than or equal to the previous offset. -- David Kleikamp IBM Linux Technology Center -
That could be a good idea. We definitely want to look at ways to solve with within the existing approach before any large scale change in behaviour. -
Yes, the problem of falsely activating pages when the file is read in small hunks is worse than the problem which your patch fixes. We could change it so that if the current read() includes the zeroeth byte of the page, we run mark_page_accessed() even if this_page==prev_page? -
Really? I would have expected all performance sensitive apps to read in >=PAGE_SIZE chunks. And if they don't because they split their dataset in blocks (like some database), it may not be so wrong to activate those pages that have two "hot" blocks more aggressively than those pages with a single hot block. So I've an hard time to advocate to prefer the current behavior, but certainly this can be "fixed" by caching the last_offset like others pointed out ;) -
On Thu, 15 Mar 2007 22:49:23 +0100 But the problem which is being fixed here is really obscure: an application repeatedly reading the first page and only the first page of a file, always via the same fd. I'd expect that the sub-page-size read scenarion happens heaps more often than that, especially when dealing with larger PAGE_SIZEs. -
Whatever that app is doing, clearly we have to keep those 4k in cache! Like obviously the specweb demonstrated that as long as you are _repeating_ the same read, it's correct to activate the page even if it was reading from the same page as before. What is wrong is to activate the page more aggressively if it's _different_ parts of the page that are being read in a contiguous way. I thought that the whole point of the ra.prev_page was to detect _contiguous_ (not random) I/O made with a small buffer, anything else doesn't make much sense to me. In short I think taking a ra.prev_offset into account as suggested by Dave Kleikamp is the best, it may actually benefit the obscure app too ;) -
Like this? :) http://surriel.com/patches/clockpro/2.6.12/useonce-cleanup -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
What if you did something like
if (jiffies%32) {...
(Possibly scaling it so the low-order bits change). No need to lock it, as
"right most of the time" is close enough.
Bad idea. That way you would only count page accesses if the phase of the moon^Wjiffie is just right. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
On the other hand, Andreas suggested only marking it once every 32 calls, but that required a helper variable. Statistically, jiffies%32 should end up about the same as a helper variable %32. This of course, if just calling mark_page_accessed() is actually expensive enough that we don't want to do it unconditionally.
Not caching a needed page and having to wait for a disk seek to complete will be *way* more expensive than any call to mark_page_accessed(). A modern CPU can do somewhere on the order of 50 million instructions in the time it takes to bring one page in from disk. However, this does not mean we should unconditionally call mark_page_accessed(), since that could cause use to push wanted data out of the cache because of one program that does its streaming accesses in a strange way... This is a situation where getting it right almost certainly matters. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
PAGE_SIZE being 8k on sparc, 16-64k on ia64 and potentially 64kb on powerpc :) Id expect a large percentage of files to be below that size. Anton -
Since we're hackling over the use-once stuff again... /me brings up: http://marc.info/?l=linux-mm&m=115316894804385&w=2 and ducks. -
Join the club ;) http://groups.google.com.au/group/linux.kernel/msg/7b3237b8e715475b?hl=en& I can't find the patch where I actually did combine it with a PG_usedonce bit, but the end result is pretty similar to your patch. And I think one or two others have also independently invented the same thing. So it *has* to be good, doesn't it? ;) -
Except for the bit where vmscan throws away page cache pages with PageReferenced set, unless they are mapped :) Also, pages that only got accessed once will have the referenced bit set too, so your patch gives no way to distinguish between pages that were accessed once and pages that were accessed multiple No argument there :) -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
I like mine better -- it leaves the comment:
How about this? It also doesn't break the use-once heuristic. -- A change to make database style random read() workloads perform better, by calling mark_page_accessed for some non-page-aligned reads broke the case of < PAGE_CACHE_SIZE files, which will not get their prev_index moved past the first page. Combine both heuristics for marking the page accessed. Signed-off-by: Nick Piggin <npiggin@suse.de> Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -929,7 +929,7 @@ page_ok: * When (part of) the same page is read multiple times * in succession, only mark it as accessed the first time. */ - if (prev_index != index) + if (prev_index != index || !offset) mark_page_accessed(page); prev_index = index; -
Acked-by: Rik van Riel <riel@redhat.com> -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
Yeah, yeah, I'm not a real mman, I don't have my own patch and website for this ;) but I'm old, let me mumble some history... Ashif's patch would take us back to 2.4.10 when mark_page_accessed was introduced: in 2.4.11 someone (probably Andrea) immediately added a !offset || !filp->f_reada condition on it there, which remains in 2.4 to this day. That _probably_ means that Ashif's patch is suboptimal, and that your !offset patch is good. f_reada went away in 2.5.8, and the !offset condition remained until 2.6.11, when Miquel (CC'ed) replaced it by today's prev_index condition. His changelog entry appended below. Since it's Miquel who removed the !offset condition, he should be consulted on its reintroduction. Miquel's patch comment from ChangeLog-2.6.11: [PATCH] mark_page_accessed() for read()s on non-page boundaries When reading a (partial) page from disk using read(), the kernel only marks the page as "accessed" if the read started at a page boundary. This means that files that are accessed randomly at non-page boundaries (usually database style files) will not be cached properly. The patch below uses the readahead state instead. If a page is read(), it is marked as "accessed" if the previous read() was for a different page, whatever the offset in the page. Testing results: - Boot kernel with mem=128M - create a testfile of size 8 MB on a partition. Unmount/mount. - then generate about 10 MB/sec streaming writes for i in `seq 1 1000` do dd if=/dev/zero of=junkfile.$i bs=1M count=10 sync cat junkfile.$i > /dev/null sleep 1 done - use an application that reads 128 bytes 64000 times from a random offset in the 64 MB testfile. 1. Linux 2.6.10-rc3 vanilla, no streaming writes: # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.22s system 5% cpu 4.456 total 2. Linux 2.6.10-rc3 vanilla, streaming writes: # time ~/rr testfile Read 128 bytes 64000 times ~/rr testfile 0.03s user 0.16s system 2% ...
Yeah I did go back and check up on that changelog, because I knew we had a !offset check there at one stage, which is immune to this problem (or at least can handle it a little better). I suspect that Miquel was probably more interested in _increasing_ mark_page_accessed coverage with his new condition than restricting it from the !offset cases. Thanks for digging it up and posting here, though. -
the !offset check looks a pretty broken heuristic indeed, it would break random I/O. The real fix is to add a ra.prev_offset along with ra.prev_page, and if who implements it wants to be stylish he can as well use a ra.last_contiguous_read structure that has a page and offset fields (and then of course remove ra.prev_page). -
I wouldn't call it broken. At worst, I'd say it's imperfect. But that's the nature of a heuristic. It most likely works in a huge I suggested something along these lines, but I wonder if it's overkill. The !offset check is simple and appears to be a decent improvement over the current code. -- David Kleikamp IBM Linux Technology Center -
well, IMHO in the huge majority of cases the prev_page check isn't necessary in the first place (and IMHO it hurts a lot more than it can help, as demonstrated by specweb, since we'll bite on the good guys to help the bad guys). The only case where I can imagine the prev_page to make sense is to handle contiguous I/O made with a small buffer, so clearly an inefficient code in the first place. But if this guy is reading with <PAGE_SIZE buffer there's no guarantee that he's reading f_pos aligned either, hence the need of taking last_offset into account too so at least it's a "perfect" heuristic that will reliably detect contiguous I/O no matter in what shape or form you execute it, as long as it is contiguous I/O. Any other variation of behavior besides the autodetection of contiguous I/O run in whatever buffer/aligned form, should be mandated by userland through fadvise/madvise IMHO or we run into the toes of the good guys. -
It sounds like people are happy with the fix suggested by Nick. That fix is okay with me as it fixes the problem I am having. I suspect, however, that by not directly detecting the problematic access pattern, where the file is accessed sequentially in small hunks, other applications may experience performance problems related to caching. For example, if an application frequently and non-sequentially reads from the same page. This is especially true for files of size < PAGE_CACHE_SIZE. But, I'm not sure if such an access pattern likely. ashif. -
Well in general we like to help applications that help themselves. It is actually a good heuristic, surprisingly. If an application randomly accesses the same page (and there is no write activity going on), then it would be better off to cache it in userspace, and if it doesn't care to do that then it won't mind having to read it off disk now and again :) -
Hi, Sounds like a good plan since this probably is a nice way to make stupid apps doing stupid things sit up and take notice, and maybe the authors will then go so far as fixing up a few more things that are hurting them once they actually recognize that something is weird due to overly bad performance. Why go to great lengths to support stupid apps when there are still so many things which could be done to help well-behaving ones? ;) Andreas Mohr -
Yes, this problem requires the file to be accessed via the same file descriptor. I suspect this behaviour is common for a certain class of applications, e.g., web servers. However, any application that read from the same page repeatedly would also experience this problem. It is more likely to occur with small files, but not exculsively. ashif. -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List |
