The RSS bits really worry me, since it looks like they could exacerbate the scalability problems that we are already running into on very large memory systems. Linux is *not* happy on 256GB systems. Even on some 32GB systems the swappiness setting *needs* to be tweaked before Linux will even run in a reasonable way. Pageout scanning needs to be more efficient, not less. The RSS bits are worrysome... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
Using a zone-per-container or N-64MB-zones-per-container should actually move us in the direction of *fixing* any such problems. Because, to a first-order, the scanning of such a zone has the same behaviour as a 64MB machine. (We'd run into a few other problems, some related to the globalness of the Please send testcases. -
Quite possibly. Taking software zones from the other large mail I sent, one could get the 64MB effect by increasing MAX_ORDER_NR_PAGES to be 64MB in pages. To avoid external fragmentation issues, I'd prefer of course if these container zones consisted of mainly contiguous memory but with It would be fixable, especially if containers do their own reclaim on their container zones and not kswapd. Writing dirty data back periodically would -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
It is not happy if you put 256GB into one zone. We are fine with 1k nodes with 8GB each and a 16k page size (which reduces the number of page_structs to manage by a fourth). So the total memory is 8TB which is significantly larger than 256GB. If we do this node/zone merging and reassign MAX_ORDER blocks to virtual node/zones for containers (with their own LRU etc) then this would also reduce the number of page_structs on the list and may make things a bit easier. We would then produce the same effect as the partitioning via NUMA nodes on our 8TB boxes. However, then you still have a bandwidth issue since your 256 likely only has a single bus and all memory traffic for the node/zones has to go through this single bottleneck. That bottleneck does not exist on NUMA machines. -
Oh come on. What's the workload? What happens? system time? user time? kernel profiles? -
I can't share all the details, since a lot of the problems are customer workloads. One particular case is a 32GB system with a database that takes most of memory. The amount of actually freeable page cache memory is in the hundreds of MB. With swappiness at the default level of 60, kswapd ends up eating most of a CPU, and other tasks also dive into the pageout code. Even with swappiness as high as 98, that system still has problems with the CPU use in the pageout code! Another typical problem is that people want to back up their database servers. During the backup, parts of the working set get evicted from the VM and performance is horrible. A third scenario is where a system has way more RAM than swap, and not a whole lot of freeable page cache. In this case, the VM ends up spending WAY too much CPU time scanning and shuffling around essentially unswappable anonymous memory and tmpfs files. I have briefly characterized some of these working sets on: http://linux-mm.org/ProblemWorkloads One thing I do not yet have are easily runnable test cases. I know the problems that happen because customers run into them, but it is not as easy to reproduce on test systems... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
On Fri, 02 Mar 2007 12:43:42 -0500 userspace fixes for this are far, far better than any magic goo the kernel can implement. We really need to get off our butts and start educating Well we've allegedly fixed that, but it isn't going anywhere without testing. -
The memory is likely in use but there is enough memory free in unmapped clean pagecache pages so that we occasionally are able to free pages. Then the app is reading more from disk replenishing that ... Thus we are forever cycling through the LRU lists moving pages between We have fixed the case in which we compile the kernel without swap. Then anonymous pages behave like mlocked pages. Did we do more than that? -
In this particular case, the system even has swap free. The kernel just chooses not to use it until it has scanned some memory, due to the way the swappiness algorithm works. With 32 CPUs diving into the page reclaim simultaneously, each trying to scan a fraction of memory, this is disastrous Not AFAIK. I would like to see separate pageout selection queues for anonymous/tmpfs and page cache backed pages. That way we can simply scan only that what we want to scan. There are several ways available to balance pressure between both sets of lists. Splitting them out will also make it possible to do proper use-once replacement for the page cache pages. Ie. leaving the really active page cache pages on the page cache active list, instead of deactivating them because they're lower priority than anonymous pages. That way we can do a backup without losting the page cache working set. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
Well I would expect this to have marginal improvements and delay the inevitable for awhile until we have even bigger memory. If the app uses mmapped data areas then the problem is still there. And such tinkering does not solve the issue of large scale I/O requiring the handling of gazillions of page structs. I do not think that there is a way around somehow handling larger chunks of memory in an easier way. We already do handle larger page sizes for some limited purposes and with huge pages we already have a larger page size. Mel's defrag/anti-frag patches are necessary to allow us to deal with the resulting fragmentation problems. -
I suspect we would not need to treat mapped file backed memory any different from page cache that's not mapped. After all, if we do proper use-once accounting, the working set will be on the active list and other cache will be flushed out the inactive list quickly. Also, the IO cost for mmapped data areas is the same as the IO cost for unmapped files, so there's no IO reason to treat them differently, either. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
Thundering herds of a sort pounding the LRU locks from direct reclaim have set off the NMI oopser for users here. -- wli -
Ditto here. The main reason they end up pounding the LRU locks is the swappiness heuristic. They scan too much before deciding that it would be a good idea to actually swap something out, and with 32 CPUs doing such scanning simultaneously... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
On Fri, 02 Mar 2007 16:19:19 -0500 What kernel version? -
Customers are on the 2.6.9 based RHEL4 kernel, but I believe we have reproduced the problem on 2.6.18 too during stress tests. I have no reason to believe we should stick our heads in the sand and pretend it no longer exists on 2.6.21. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
On Fri, 02 Mar 2007 17:03:10 -0500 Opterons seem to be particularly prone to lock starvation where a cacheline I have no reason to believe anything. All I see is handwaviness, speculation and grand plans to rewrite vast amounts of stuff without even a testcase to demonstrate that said rewrite improved anything. None of this is going anywhere, is it? -
We tested them. They only alleviate the problem slightly in good situations, but things still fall apart badly with less Your attitude is exactly why the VM keeps falling apart over and over again. Fixing "a testcase" in the VM tends to introduce problems for other test cases, ad infinitum. There's a reason we end up fixing the same bugs over and over again. I have been looking through a few hundred VM related bugzillas and have found the same bugs persist over many different versions of Linux, sometimes temporarily fixed, but they seem I will test my changes before I send them to you, but I cannot promise you that you'll have the computers or software needed to reproduce the problems. I doubt I'll have full time access to such systems myself, either. 32GB is pretty much the minimum size to reproduce some of these problems. Some workloads may need larger systems to easily trigger them. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
We can find a 32GB system here pretty easily to test things on if need be. Setting up large commercial databases is much harder. I don't have such a machine in the public set of machines we're going to push to test.kernel.org from at the moment, but will see if I can arrange it in the future if it's important. M. -
That's my problem, too. There does not seem to exist any single set of test cases that accurately predicts how the VM will behave with customer workloads. The one thing I can do relatively easily is go through a few hundred bugzillas and figure out what kinds of problems have been plaguing the VM consistently over the last few years. I just finished doing that, and am trying to come up with fixes for the problems that just don't seem to be easily fixable with bandaids... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
Tracing might help? Showing Andrew traces of what happened in production for the prev_priority change made it much easier to demonstrate and explain the real problem ... M. -
On Fri, 02 Mar 2007 15:28:43 -0800 Tracing is one way. The other way is the old scientific method: - develop a theory - add sufficient instrumentation to prove or disprove that theory - run workload, crunch on numbers - repeat Of course, multiple theories can be proven/disproven in a single pass. Practically, this means adding one new /prov/vmstat entry for each `goto keep*' in shrink_page_list(). And more instrumentation in shrink_active_list() to determine the behaviour of swap_tendency. Once that process is finished, we should have a thorough understanding of what the problem is. We can then construct a testcase (it'll be a couple hundred lines only) and use that testcase to determine what implementation changes are needed, and whether it actually worked. Then go back to the real workload, verify that it's still fixed. Then do whitebox testing of other workloads to check that they haven't regressed. -
Hundreds of disks all doing IO at once may also be needed, as wli points out. Such systems are not readily available for testing. -
On Fri, 02 Mar 2007 17:34:31 -0500 What is it with vendors finding MM problems and either not fixing them or kludging around them and not telling the upstream maintainers about *any* In that case it was a bad fix. The aim is to fix known problems without introducing regressions in other areas. A perfectly legitimate approach. 32GB isn't particularly large. Somehow I don't believe that a person or organisation which is incapable of preparing even a simple testcase will be capable of fixing problems such as this without breaking things. -
I don't believe anybody who relies on one simple test case will ever be capable of evaluating a patch without breaking things. Test cases can show problems, but fixing a test case is no guarantee at all that your VM will behave ok with real world workloads. Test cases for the VM can *never* be relied on to show that a problem went away. I'll do my best, but I can't promise a simple test case for every single problem that's plaguing the VM. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. -
I'm not in the business of defending vendors, but a lot of times the base is so far downrev it's difficult to relate it to much of anything current. It may be best not to say precisely how far downrev things can get, since some of these things are so old even distro vendors won't touch them. My gut feeling is to agree, but I get nagging doubts when I try to think of how to boil things like [major benchmarks whose names are trademarked/copyrighted/etc. censored] down to simple testcases. Some other things are obvious but require vast resources, like zillions of disks fooling throttling/etc. heuristics of ancient downrev kernels. I guess for those sorts of things the voodoo incantations, chicken blood, and carcasses of freshly slaughtered goats come out. Might as well throw in a Tarot reading and some tea leaves while I'm at it. My tack on basic stability was usually testbooting on several arches, which various people have an active disinterest in (suggesting, for example, that I throw out all of my sparc32 systems and replace them with Opterons, or that anything that goes wrong on ia64 is not only irrelevant but also that neither I nor anyone else should ever fix them; you know who you are). It's become clear to me that this is insufficient, and that I'll need to start using some sort of suite of regression tests, at the very least to save myself the embarrassment of acking a patch that oopses when exercised, but also to elevate the standard. -- wli -
On Fri, 2 Mar 2007 17:40:04 -0800 noooooooooo. You're approaching it from the wrong direction. Step 1 is to understand what is happening on the affected production system. Completely. Once that is fully understood then it is a relatively simple matter to concoct a test case which triggers the same failure mode. It is very hard to go the other way: to poke around with various stress tests which you think are doing something similar to what you think the application does in the hope that similar symptoms will trigger so you can then work out what the kernel is doing. yuk. -
Yeah, it's really great when it's possible to get debug info out of people e.g. they're willing to boot into a kernel instrumented with the appropriate printk's/etc. Most of the time it's all guesswork. People who post to lkml are much better about all this on average. I never truly understood the point of kprobes/jprobes/dprobes (or whatever the probing letter is), crash dumps, and so on until I ran into this, not that I use personally them (though I may yet start). Most of the time I just read the code instead and smoke out what could be going on by something like the process of devising counterexamples. For instance, I told that colouroff patch guy about the possibility of getting the wrong page for the start of the buffer from virt_to_page() on a cache colored buffer pointer (clearly cache->gfporder >= 4 in such a case). Deriving the head page without __GFP_COMP might be considered to be ugly-looking, though. -- wli -
The first thing done by timespec_trunc() is : if (gran <= jiffies_to_usecs(1) * 1000) This should really be a test against a constant known at compile time. Alas, it isnt. jiffies_to_usec() was unilined so C compiler emits a function call and a multiply to compute : a CONSTANT. mov $0x1,%edi mov %rbx,0xffffffffffffffe8(%rbp) mov %r12,0xfffffffffffffff0(%rbp) mov %edx,%ebx mov %rsi,0xffffffffffffffc8(%rbp) mov %rsi,%r12 callq ffffffff80232010 <jiffies_to_usecs> imul $0x3e8,%eax,%eax cmp %ebx,%eax This patch reorders kernel/time.c a bit so that jiffies_to_usecs() is defined before timespec_trunc() so that compiler now generates : cmp $0x3d0900,%edx (HZ=250 on my machine) This gives a better code (timespec_trunc() becoming a leaf function), and shorter kernel size as well. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
AIUI that phenomenon is universal to NUMA. Maybe it's time we reexamined our locking algorithms in the light of fairness considerations. -- wli -
On Fri, 2 Mar 2007 16:33:19 -0800 It's also a multicore thing. iirc Kiran was seeing it on Intel CPUs. I expect the phenomenon would be observeable on a number of locks in the kernel, give the appropriate workload. We just hit it first on lru_lock. I'd have thought that increasing SWAP_CLUSTER_MAX by two or four orders of magnitude would plug it, simply by decreasing the acquisition frequency but I think Kiran fiddled with that to no effect. See below for Linus's thoughts, forwarded without permission.. Begin forwarded message: Date: Mon, 22 Jan 2007 13:49:02 -0800 (PST) From: Linus Torvalds <torvalds@linux-foundation.org> To: Andrew Morton <akpm@osdl.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Ravikiran G Thirumalai <kiran@scalex86.org> Subject: Re: High lock spin time for zone->lru_lock under extreme conditions I think people need to realize that spinlocks are always going to be unfair, and *extremely* so under some conditions. And yes, multi-core brought those conditions home to roost for some people (two or more cores much closer to each other than others, and able to basically ping-pong the spinlock to each other, with nobody else ever able to get it). There's only a few possible solutions: - use the much slower semaphores, which actually try to do fairness. - if you cannot sleep, introduce a separate "fair spinlock" type. It's going to be appreciably slower (and will possibly have a bigger memory footprint) than a regular spinlock, though. But it's certainly a possible thing to do. - make sure no lock that you care about ever has high enough contention to matter. NOTE! back-off etc simply will not help. This is not a back-off issue. Back-off helps keep down coherency traffic, but it doesn't help fairness. If somebody wants to play with fair spinlocks, go wild. I looked at it at one point, and it was not wonderful. It's pretty complicated to do, and the best way I could come up with was ...
This is a phenomenon that is usually addressed at the cache logic level. Its a hardware maturation issue. A certain package should not be allowed to hold onto a cacheline forever and other packages must have a mininum time when they can operate on that cacheline. -
I think when I last asked about that I was told "cache directories are too expensive" or something on that order, if I'm not botching this, too. In any event, the above shows a gross inaccuracy in my statement. -- wli -
That'd be nice. Unfortunately we're stuck in the real world with real hardware, and the situation is likely to remain thus for quite some time ... M. -
Our real hardware does behave as described and therefore does not suffer from the problem. If you want a software solution then you may want to look at Zoran Radovic's work on Hierachical Backoff locks. I had a draft of a patch a couple of years back that showed some promise to reduce lock contention. HBO locks can solve starvation issues by stopping local lock takers. See Zoran Radovic "Software Techniques for Distributed Shared Memory", Uppsala Universitet, 2005 ISBN 91-554-6385-1. http://www.gelato.org/pdf/may2005/gelato_may2005_numa_lameter_sgi.pdf http://www.gelato.unsw.edu.au/archives/linux-ia64/0506/14368.html -
On Fri, 2 Mar 2007 10:15:36 -0800 (PST) oh yeah, we took the ran-out-of-swapcache code out. But if we're going to do this thing, we should find some way to bring it back. -
I know of one sounding similar to this where unreclaimable pages are pinned by refcounts held by bio's spread across about 850 spindles. It's mostly read traffic. Several different tunables could be used to work around it, nr_requests in particular, but also clamping down on dirty limits to preposterously low levels and setting preposterously large values of min_free_kbytes. Their kernel is, of course, substantially downrev (2.6.9-based IIRC), so douse things heavily with grains of salt. -- wli -
