The following patchset implements recursive reclaim. Recursive reclaim
is necessary if we run out of memory in the writeout patch from reclaim.This is f.e. important for stacked filesystems or anything that does
complicated processing in the writeout path.Recursive reclaim works because it limits itself to only reclaim pages
that do not require writeout. It will only remove clean pages from the LRU.
The dirty throttling of the VM during regular reclaim insures that the amount
of dirty pages is limited. If recursive reclaim causes too many clean pages
to be removed then regular reclaim will throttle all processes until the
dirty ratio is restored. This means that the amount of memory that can
be reclaimed via recursive reclaim is limited to clean memory. The default
ratio is 10%. This means that recursive reclaim can reclaim 90% of memory
before failing. Reclaiming excessive amounts of clean pages may have a
significant performance impact because this means that executable pages
will be removed. However, it ensures that we will no longer fail in the
writeout path.A patch is included to test this functionality. The test involved allocating
12 Megabytes from the reclaim paths when __PF_MEMALLOC is set. This is enough
to exhaust the reserves.--
-
Hi Christoph,
Over the last two weeks we have tested your patch set in the context of
ddsnap, which used to be prone to deadlock before we added a series of
anti-deadlock measures, including Peter's anti-deadlock patch set, our
own bio throttling code and judicious use of PF_MEMALLOC mode. This
cocktail of patches finally banished the deadlocks, none of which have
been seen during several months of heavy testing. The question in
which you are interested no doubt, is whether your patch set also
solves the same deadlocks.The results are mixed. I will briefly describe the test setup now. If
you are interested in specific details for independent verification, we
can provide the full recipe separately. We used the patches here:http://zumastor.googlecode.com/svn/trunk/ddsnap/patches/2.6.21.1/
driven by the scripted storage application here:
http://zumastor.googlecode.com/svn/trunk/zumastor/
If we remove our anti-deadlock measures, including the ddsnap.vm.fixes
(a roll-up of Peter's patch set) and the request throttling code in
dm-ddsnap.c, and apply your patch set instead, we hit deadlock on the
socket write path after a few hours (traceback tomorrow). So your
patch set by itself is a stability regression.There is also some good news for you here. The combination of our
throttling code, plus your recursive reclaim patches and some fiddling
with PF_LESS_THROTTLE has so far survived testing without deadlocking.
In other words, as far as we have tested it, your patch set can
substitute for Peter's and produce the same effect, provided that we
throttle the block IO traffic.Just to recap, we have identified two essential ingredients in the
recipe for writeout deadlock prevention:1) Throttle block IO traffic to a bounded maximum memory use.
2) Guarantee availability of the required amount of memory.
Now we have learned that (1) is not optional with either the peterz or
the clameter approach, and we are wondering which is the b...
Na, that cannot be the case since it only activates when an OOM condition
Efficiency is not a criterion for a rarely used emergency recovery
Peters patch is much more invasive and requires a coupling of various
We have a global dirty page limit already. I fully support Peters work on
dirty throttling.These results show that Peters invasive approach is not needed. Reclaiming
easy reclaimable pages when necessary is sufficient.
-
I did not express myself clearly then. Compared to our current
anti-deadlock patch set, you patch set is a regression. Because
without help from some of our other patches, it does deadlock.That depends on how rarely used. Under continuous, heavy load this may
I agree that Peter's patch set is larger than necessary. I do not agree
I do not agree with that line of thinking. A single test load only
provides evidence, not proof. Your approach is not obviously correct,
quite the contrary. The tested patch set does not help atomic alloc at
all, which is clearly a problem we can hit, we just did not hit it thisAlas, I communicated exactly the opposite of what I intended. We do not
like the global dirty limit. It makes the vm complex and fragile,
unnecessarily. We favor an approach that places less reliance on the
global dirty limit so that we can remove some of the fragile and hardThese results do not show that at all, I apologize for not making that
sufficiently clear.Regards,
Daniel
-
Of course boundless allocations from interrupt / reclaim context will
ultimately crash the system. To fix that you need to stop the networkingThe patch is obviously correct because it provides memory where we used to
So far our experience has just been the opposite and Peter's other patches
demonstrate the same. Dirty limits make the VM stable and increase I/O
performance.-
Trouble is, I don't only need a network layer to not endlessly consume
memory, I need it to 'fully' function so that we can receive the
writeout completion.Let us define a strict meaning for a few phrases:
use memory - an alloc / free cycle where the free is unconditional
consume memory - an alloc / free cycle where the free is conditional
and or might be delayed for some unspecified time.Currently networking has two states:
1) it receives packets and consumes memory
2) it doesn't receive any packets and doesn't use any memory.In order to use swap over network you need to operate the network stack
in a bounded memory model (PF_MEMALLOC). So we need a state that:- receives packets
- does NOT consume memory
- but does use memory - albeit limited.There are two ways to do this:
- reserve a specified amount of memory per socket
(allegedly IRIX has this)or
- have a global reserve and selectively serves sockets
(what I've been doing)These two models can be seen as the same. There is no fundamental
difference between having various small reserves and one larger that is
carved up using strict accounting.So, if you will, you can view my approach as a reserve per socket, where
most sockets get a reserve of 0 and a few (those serving the VM) !0.What part are you disagreeing with or unclear on?
You need to drop packets after having inspected them right? Why wont
dropping packets after a certain amount of memory has been allocated work?That is a scalability problem on large systems! Global means global
serialization, cacheline bouncing and possibly livelocks. If we get into
this global shortage then all cpus may end up taking the same locksWell it looks like you know how to do it. Why not implement it?
-
That puts the burden of tracking skb allocations and all that on the
fast path.The 'simplicity' of my current approach is that we only start
Dude, breathe, these boxens of yours will never swap over network simply
because you never configure swap.=20And, _no_, it does not necessarily mean global serialisation. By simply
saying there must be N pages available I say nothing about on which node
they should be available, and the way the watermarks work they will be/me confused, I already have!
If you talk about the IRIX model, I'm very hestitant to do that simply
because that would incur the bean-counting overhead on the normal case
and that will greatly upset the network people - nor would that mean
that I don't need this stricter PF_MEMALLOC behaviour.
Agreed. Scalability of emergency swapping reserved is simply
unimportant. Please, lets get swapping to _work_ first, then we can
make it faster.No, I do not think we'll ever see a livelock on this.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
Hi Peter,
The term is "highwater mark" not "high watermark". A watermark is an
anti-counterfeiting device printed on paper money. "Highwater" is how
high water gets, which I believe is the sense we intend in Linux.
Therefore any occurrence of "watermark" in the kernel source is a
spelling mistake, unless it has something to do with printing paper
money.While fixing this entrenched terminology abuse in our kernel source may
be difficult, sticking to the correct English on lkml is quite easy :-)Regards,
Daniel
-
Global reserve means that any cpuset that runs out of memory may exhaust
the global reserve and thereby impact the rest of the system. The
emergencies that are currently localized to a subset of the system and
may lead to the failure of a job may now become global and lead to the
failure of all jobs running on it.But Peter mentioned that he has some way of tracking the amount of memory
used in a certain context (beancounter?) which would address the issue.
-
If it does, it is a bug in the reserve accounting. That said, I still
agree with you that per-node reserve is a desirable goal for numa. I
would just like to be clear that it is not necessary, even for numa,
just nice. By all means somebody should be hacking on a numa feature
for per-node emergency reserves, but as far as fixing the immediate,
serious kernel block IO deadlocks goes, it does not matter.Pavel, I do not agree that efficiency is unimportant on the
under-pressure path. I do not even like to call that the "emergency"
path, because under heavy load it is normal for a machine to spend a
significant fraction of its time in that state. However, the
efficiency goal there does not need to be quite the same as normal
mode.To illustrate, I would expect to see something like 95% of normal block
IO performance on a numa machine in the case that "emergency" (aka
memalloc memory) is allocated globally instead of locally, thus paying
a (modest compared to the disk transfer itself) penalty for transfer of
disk data over the numa interconnect. 95% of normal throughput on the
block IO path is not a problem: if the machine spends 5% of its time on
the "emergency" (aka memalloc) path, then overall efficiency will be
95% * 95% = 99.75%.Moral of this story: let's get the memory recursion fixes done in the
most obviously correct way and not get distracted by illusory
efficiency requirements for numa, that do not have a big bottom line
impact.I'm glad to see everybody still interested in these problems. Though we
have been a little quiet on this issue over here for a while, it does
not mean that progress has stopped. In fact, we are testing our
solutions more heavily than ever, and getting closer to a solution that
not only works solidly, but that should enable mass deletion of the
whole creaky notion of dirty page limits in favor of nice, tight
per-device control of in flight write traffic as I have described
previously.Regards,
Daniel
...
Can you be specific about which changes to existing mainline code were
needed to make recursive reclaim "work" in your tests (albeit less
ideally than peterz's patchset in your view)?Which changes allowed you to address 1? I had a look at the various
patches you provided (via svn) and it wasn't clear which subset
fulfilled 1 for you. Does it work for all Block IO and not just
specially tuned drivers like ddsnap et al?regards,
Mike
-
Sorry, I was incommunicado out on the high seas all last week. OK, the
measures that actually prevent our ddsnap driver from deadlocking are:- Statically prove bounded memory use of all code in the writeout
path.- Implement any special measures required to be able to make such a
proof.- All allocations performed by the block driver must have access
to dedicated memory resources.- Disable the congestion_wait mechanism for our code as much as
possible, at least enough to obtain the maximum memory resources
that can be used on the writeout path.The specific measure we implement in order to prove a bound is:
- Throttle IO on our block device to a known amount of traffic for
which we are sure that the MEMALLOC reserve will always be
adequate.Note that the boundedness proof we use is somewhat loose at the moment.
It goes something like "we only need at most X kilobytes of reserve and
there are X megabytes available". Much of Peter's patch set is aimed
at getting more precise about this, but to be sure, handwaving just
like this has been part of core kernel since day one without too many
ill effects.The way we provide guaranteed access to memory resources is:
- Run critical daemons in PF_MEMALLOC mode, including
any userspace daemons that must execute in the block IO path
(cluster coders take note!)Right now, all writeout submitted to ddsnap gets handed off to a daemon
running in PF_MEMALLOC mode. This is a needless inefficiency that we
want to remove in future, and handle as many of those submissions as
possible entirely in the context of the submitter. To do this, further
measures are needed:- Network writes performed by the block driver must have access to
dedicated memory resources.We have not yet managed to trigger network read memory deadlock, but it
is just a matter of time, additional fancy virtual block devices, and
enough stress. So:- Network reads need some fancy extr...
Hope you enjoyed yourself. First off, as always thanks for the
extremely insightful reply.To give you context for where I'm coming from; I'm looking to get NBD
to survive the mke2fs hell I described here:Once the memory requirements of a userspace daemon (e.g. nbd-server)
are known; should one mlockall() the memory similar to how is done in
heartbeat daemon's realtime library?Bigger question for me is what kind of hell am I (or others) in for to
try to cap nbd-server's memory usage? All those glib-gone-wild
changes over the recent past feel problematic but I'll look to workWould peter's per bdi dirty page accounting patchset provide this? If
not, what steps are you taking to disable this mechanism? I've found
that nbd-server is frequently locked with 'blk_congestion_wait' in itsI've embraced Evgeniy's bio throttle patch on a 2.6.22.6 kernel
http://thread.gmane.org/gmane.linux.network/68021/focus=68552But are you referring to that (as you did below) or is this more a
I've been using Avi Kivity's patch from some time ago:
http://lkml.org/lkml/2004/7/26/68to get nbd-server to to run in PF_MEMALLOC mode (could've just used
the _POSIX_PRIORITY_SCHEDULING hack instead right?)... it didn't help
on its own; I likely didn't have enough of the stars aligned to see myI assume peterz's network deadlock avoidance patchset (or some subset
OK, yes I've included Christoph's recursive reclaim patch and didn't
have any luck either. Good to know that patch isn't _really_ going toI've been working off-list (with Evgeniy's help!) to give the bio
throttling patch a try. I hacked MD (md.c and raid1.c) to limit NBD
members to only 10 in-flight IOs. Without this throttle I'd see up to
170 IOs on the raid1's nbd0 member; with it the IOs holds farely
constant at ~16. But this didn't help my deadlock test either. Also,
throttling in-flight IOs like this feels inherently sub-optimal. Have
you taken any steps to make the 'bio-limit' dynamic in some way?Anyway, I'm th...
On Mon, 17 Sep 2007 23:27:25 -0400 "Mike Snitzer" <snitzer@gmail.com>
BDI should be back in -mm, for the other its in shambles atm, I'll tell
you where to find it when I've put it back together.I should get myself some time to read on how to push relative git
NBD has some serious block layer issues, I once talked with Jens about
it and he explained what needed to be done to get NBD back in
shape again, but I could not be bothered to spend time on it.
[ and have since forgotten most of the details :-/ ]For me NBD is dead and broken beyond repair, it needs a wholesale
rewrite.
-
(Reposted for completeness. Previously rejected by vger due to
accidental send as html mail. CC's except for Mike and vger deleted)The dread blk_congestion_wait is biting you hard. We're very familiar
with the feeling. Congestion_wait is basically the traffic cop that
implements the dirty page limit. I believe it was conceived as a
method of fixing writeout deadlocks, but in our experience it does not
help, in fact it introduces a new kind of deadlock
(blk_congestion_wait) that is much easier to trigger. One of the
things we do to get ddsnap running reliably is disable congestion_wait
via the PF_LESS_THROTTLE hack that was introduced to stop local NFS
clients from deadlocking. NBD will need a similar treatment.Actually, I hope to show quite soon that dirty page limiting is not
needed at all in order to prevent writeout deadlock. In which case we
can just get rid of the dirty limits and go back to being able to use
all of non-reserve memory as a write cache, the way things used to be
in the days of yore.It has been pointed out to me that congestion_wait not only enforces
the dirty limit, it controls the balancing of memory resources between
slow and fast block devices. The Peterz/Phillips approach to deadlock
prevention does not provide any such balancing and so it seems to me
that congestion_wait is ideally situated in the kernel to provide that
missing functionality. As I see it, blk_congestion_wait can easily be
modified to balance the _rate_ at which cache memory is dirtied for
various block devices of different speeeds. This should turn out to
be less finicky than balancing the absolute ratios, after all you can
make a lot of mistakes in rate limiting and still not deadlock so long
as dirty rate doesn't drop to zero and stay there for any block
device. Gotta be easy, hmm?Please note: this plan is firmly in the category of speculation until
we have actually tried it and have patches to show, but I thought that
now is about the right time to say somethin...
First of all, I'm not surprised these patches solve the deadlock here.
And that's a good thing, and it means it is likely that we want to merge
it (actually, I quite like the idea in general, regardless of whether it
solves the deadlock or not).However I really have an aversion to the near enough is good enough way of
thinking. Especially when it comes to fundamental deadlocks in the VM. I
don't know whether Peter's patch is completely clean yet, but fixing the
fundamentally broken code has my full support.I hate it that there are theoretical bugs still left even if they would
be hit less frequently than hardware failure. And that people are really
happy to put even more of these things in :(Anyway, as you know I like your patch and if that gives Peter a little
more breathing space then it's a good thing. But I really hope he doesn't
give up on it, and it should be merged one day.-
Uhh. There are already numerous other issues why the VM is failing that is
Theoretical bugs? Depends on one's creativity to come up with them I
guess. So far we do not even get around to address the known issues andUsing the VM to throttle networking is a pretty bad thing because it
assumes single critical user of memory. There are other consumers of
memory and if you have a load that depends on other things than networking
then you should not kill the other things that want memory.
-
The VM is a _critical_ user of memory. And I dare say it is the _most_
important user.=20Every user of memory relies on the VM, and we only get into trouble if
the VM in turn relies on one of these users. Traditionally that has only
been the block layer, and we special cased that using mempools and
PF_MEMALLOC.Why do you object to me doing a similar thing for networking?
The problem of circular dependancies on and with the VM is rather
limited to kernel IO subsystems, and we only have a limited amount of
them.=20You talk about something generic, do you mean an approach that is
generic across all these subsystems?If so, my approach would be it, I can replace mempools as we have them
with the reserve system I introduce.
The users of memory are various subsystems. The VM itself of course also
uses memory to manage memory but the important thing is that the VMI have not seen you using mempools for the networking layer. I would not
The kernel has to use the filesystems and other subsystems for I/O. These
subsystems compete for memory in order to make progress. I would not
consider strictly them part of the VM. The kernel reclaim may trigger I/OYes an approach that is fair and does not allow one single subsystem to
Replacing the mempools for the block layer sounds pretty good. But how do
these various subsystems that may live in different portions of the system
for various devices avoid global serialization and livelock through your
system? And how is fairness addresses? I may want to run a fileserver on
some nodes and a HPC application that relies on a fiberchannel connection
on other nodes. How do we guarantee that the HPC application is not
impacted if the network services of the fileserver flood the system with
messages and exhaust memory?-
Exactly, and because it services every other subsystem and userspace,
Dude, listen, how often do I have to say this: I cannot use mempools for
the network subsystem because its build on kmalloc! What I've done is
build a replacement for mempools - a reserve system - that does work
similar to mempools but also provides the flexibility of kmalloc.I'm confused by this, I've never claimed part of, or such a thing. All
I'm saying is that because of the circular dependency between the VM and
the IO subsystem used for swap (not file backed paging [*], just swap)
you have to do something special to avoid deadlocks.[*] the dirty limit along with 'atomic' swap ensures that file backed
I do no such thing! My reserve system works much like mempools, you
The reserves are spread over all kernel mapped zones, the slab allocator
is still per cpu, the page allocator tries to get pages from the nearestThe network system reserves A pages, the block layer reserves B pages,
once they start getting pages from the reserves they go bean counting,
once they reach their respective limit they stop.The serialisation impact of the bean counting depends on how
fine-grained you place them, currently I only have a machine wide
network bean counter because the network subsystem is machine wide -
initially I tried to do something per net-device but that doesn't work
out. If someone more skilled in this area comes along and sees a better
way to place the bean counters they are free to do so.But do notice that the bean counting is only done once we hit the
reserves, the normal mode of operation is not penalised by the extra
overhead thereof.Also note that mempools also serialise their access once the backing
allocator fails, so I don't differ from them in that respect either.
Its different since it becomes a privileged player that can suck all
How are dirty file backed pages different? They may also be written out
But it seems that you have unbounded allocations with PF_MEMALLOC now for
That sounds good.
-
No, each reserve user comes with a bean-counter that will limit the
when you have dirty file backed pages, the rest of the memory can only
consists of clean file pages and or anonymous pages - due to the dirty
limit. If you can guarantee that swap doesn't use memory (well, it does,
but its PF_MEMALLOC memory that cannot be used by others) then you can
always free memory by dropping clean pages or swapping out. And thus
make progress for file based writeback.No, networking will beancount all PF_MEMALLOC memory it receives, and
stop allocating once it hits it limit. It knows that when it has thanOk, so next time I'll post the whole series again - I know some people
found it too much - but that way you can see the bean counter.-
I don't know what your point is? We either ignore it, or try to fix things
Implementation issues aside, the problem is there and I would like to
see it fixed regardless if some/most/or all users in practice don't
hit it.-
I am all for fixing the problem but the solution can be much simpler and
more universal. F.e. the amount of tcp data in flight may be controlled
via some limit so that other subsystems can continue to function even if
we are overwhelmed by network traffic. Peter's approach establishes the
limit by failing PF_MEMALLOC allocations. If that occurs then other
subsystems (like the disk, or even fork/exec or memory management
allocation) will no longer operate since their allocations no longer
succeed which will make the system even more fragile and may lead to
subsequent failures.-
You're saying we shouldn't fix an out of memory deadlocks because
that might result in ENOMEM errors being returned, rather than the
system locking up?-
With swap over network you need not only protect other subsystems from
networking, but you also have to guarantee networking will in some formI'm not failing PF_MEMALLOC allocations. I'm more stringent in failing !
Failing allocations should never be a stability problem, we have the
fault-injection framework which allows allocations to fail randomly -
this should never crash the kernel - if it does its a BUG.
Allright maybe you can get the kernel to be stable in the face of having
no memory and debug all the fallback paths in the kernel when an OOM
condition occurs.But system calls will fail? Like fork/exec? etc? There may be daemons
running that are essential for the system to survive and that cannot
easily take an OOM condition? Various reclaim paths also need memory and
if the allocation fails then reclaim cannot continue.-
I'm not making any of these paths significantly more likely to occur
than they already are. Lots and lots of users run swap heavy loads day
in day out - they don't get funny systems (well sometimes they do, and
theoretically we can easily run out of the PF_MEMALLOC reserves -
HOWEVER in practise it seems to work quite reliably).
The patchset increases these failures significantly since there will be a
longer time period where these allocations can fail.The swap loads are fine as long as we do not exhaust the reserve pools.
IMHO the right solution is to throttle the networking layer to not do
unbounded allocations. You can likely do this by checking certain VM
counters like SLAB_UNRECLAIMABLE. If need be we can add a new category of
SLAB_TEMPORARY for temporary allocs and track these. If they get too large
then throttle.
-
And I'm working hard to guarantee the additional logic does not exhaust
I'm utterly confused as to why you propose all these heuristics when I
have a perfectly good solution that is exact.
Filesystems (most of them) that require compilcated allocations at
writeout time suck. That said, especially with network ones, it
seems like making them preallocate or reserve required memory isn't
progressing very smoothly. I think these patchsets are definitely
worth considering as an alternative.-
Honestly, I don't. They very much do not solve the problem, they just
displace it.Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor
Please do ponder the problem and its proposed solutions, because I'm
going crazy here.The problem with networked swap is:
TX
- we need some memory to initiate writeout
- writeout needs to be throttled in order to make this bounded(currently sort-of done by throttle_vm_writeout() - but evginey and
daniel phillips are working on a more generic approach)RX
- we basically need infinite memory to receive the network reply
to complete writeout. Consider the following scenario:3 machines, A, B, C;
A: * networked swapped
* networked serviceB: * client for networked service
C: * server for networked swap
C becomes unreachable/slow for a while
B sends massive amounts of traffic A wards
A consumes all memory with non-critical traffic from B and wedges- so we need a threshold of some sorts to start tossing non-critical
network packets away. (because the consumer of these packets may be
the one swapping and is therefore frozen)- we also need to ensure memory doesn't fragment too badly during the
receiving -> tossing phase. Otherwise we might again wedge due to OOMand then there is an TCP specific deadlock: TCP has a global limit on
the amount of skb memory that can be in socket receive queues. Once we
hit this limit with non-critical data (because the consumers are waiting
on swap) all further packets will be tossed and we'll never receive C's
completion<> Now my solution was to have a reserve just big enough to fit:
- TX
- RX (large enough to overflow the IP fragment reassembly)that way, whenever we receive a packet and find we need the reserve
to back this packet we must only use this for critical services.
(this provides the threshold previously mentioned)we then process the packet until socket demux (where the sk...
Well perhaps it doesn't work for networked swap, because dirty accounting
doesn't work the same way with anonymous memory... but for _filesystems_,
right?I mean, it intuitively seems like a good idea to terminate the recursive
allocation problem with an attempt to reclaim clean pages rather than
immediately let them have-at our memory reserve that is used for other
things as well. Any and all writepage() via reclaim is allowed to eat
into all of memory (I hate that writepage() ever has to use any memory,
and have prototyped how to fix that for simple block based filesystemsWell of course it doesn't, but it is a pragmatic way to reduce some
memory depletion cases. I don't see too much harm in it (although I didn'tWell yeah I think you simply have to reserve a minimum amount of memory in
order to reclaim a page, and I don't see any other way to do it other than
what you describe to be _technically_ deadlock free.But firstly, you don't _want_ to start dropping packets when you hit a tough
patch in reclaim -- even if you are strictly deadlock free. And secondly,
I think recursive reclaim could reduce the deadlocks in practice which is
not a bad thing as your patches aren't merged.How are your deadlock patches going anyway? AFAIK they are mostly a network
issue and I haven't been keeping up with them for a while. Do you really need
networked swap and actually encounter the deadlock, or is it just a question ofAlthough you will quite likely have at least a couple of MB worth of
clean program text. The important part of recursive reclaim is that it
doesn't so easily allow reclaim to blow all memory reserves (including
interrupt context). Sure you still have theoretical deadlocks, but if
I understand correctly, they are going to be lessened. I would be
really interested to see if even just these recursive reclaim patchesThanks!
-
I'm concerned about the worst case scenarios, and those don't change.
The proposed changes can be seen as an optimisation of various things,Sure, and on that note I don't object to them, they might be quite
Right, and I guess I have to go at it again, this time ensuring not to
touch the fast-path nor sacrificing anything NUMA for simplicity in the
reclaim path.(I think its a good thing to be technically deadlock free - and if your
work on the fault path rewrite and buffered write rework shows anythingNon of the people who have actually used these patches seem to object to
the dropping packets thing. Nor do I see that as a real problem,
networks are assumed lossy - also if you really need that traffic for a
RT app that also runs on the machine you need networked swap on (odd
combination but hey, it should be possible) then I can make that work as
well with a little bit more effort.Also, I'm a very reluctant to accept a known deadlock, esp. since the
They really do rely on some VM interaction too, network does not have
enough information to break out of the deadlock on its own.As for how its going, it seems to work quite reliably in my test setup -
that is, I can shut down the NFS server, swamp the client in network
traffic for hours (yes it will quickly stop userspace) and then restart
the NFS server and the client will reconnect and resume operation.There are also a few people running various versions of my patches in
production environments. One university is running it on a 500-node
cluster and another on ~500 thin-clients and there is someone using itYes (we - not I personally) want networked swap. There is quite the
demand for it in the marked. It allows clusters and blades to be build
without any storage - which not only saves on the initial cost of a hard
drive [1] but also on maintenance but more importantly on energy cost
and heat production.[1] a single drive is not that expensive, but when you're talking about
were we much bothered by the buf...
No, although it sounded like you didn't see any use in these patches.
Which be true if you're just looking at solving the theoretical deadlocks,
but I just think they might be worth looking at to practically solve some
of them and just give better reclaim behaviour in general (but in sayingI do of course. There is one thing to have a real lock deadlock
in some core path, and another to have this memory deadlock in a
known-to-be-dodgy configuration (Linus said last year that he didn't
want to go out of our way to support this, right?)... But if you can
solve it without impacting fastpaths etc. then I don't see anyI don't mean for correctness, but for throughput. If you're doing a
lot of network operations right near the memory limit, then it could
be possible that these deadlock paths get triggered relatively often.The thing I don't much like about your patches is the addition of more
of these global reserve type things in the allocators. They kind of
suck (not your code, just the concept of them in general -- ie. including
the PF_MEMALLOC reserve). I'd like to eventually reach a model where
reclaimable memory from a given subsystem is always backed by enough
resources to be able to reclaim it. What stopped you from going that
route with the network subsystem? (too much churn, or somethingAs a general statement, I agree of course ;)
-
That sounds very right aside from the global reserve. A given subsystem
may exist in multiple instances and serve sub partitions of the system.
F.e. there may be a network card on node 5 and a job running on nodes 3-7
and another netwwork card on node 15 with the corresponding nodes 13-17
doing I/O through it.
-
[ now with CCs ]
That has been my intention, getting the problem solved without touching
Christoph's patches all rely on file backed memory being predominant.
[ and to a certain degree fully ignore anonymous memory loads :-( ]Whereas quite a few realistic loads strive to minimise these - I'll
again fall back to my MPI cluster example, they would want to use so
much anonymous memory to preform their calculations that everything
except the hot paths of code are present in memory. In these scenarios 1I'm wanting to keep the patches as non-intrusive as possible, exactly
because some people consider this a fringe functionality. Doing as you
say does sound like a noble goal, but would require massive overhauls.Also, I'm not quite sure how this would apply to networking. It
generally doesn't have much reclaimable memory sitting around, and it
heavily relies on kmalloc so an alloc/free cycle accounting system would
quickly involve a lot of the things I'm already doing.(also one advantage of keeping it all in the buddy allocator is that it
can more easily form larger order pages)
OK, I don't know exactly about MPI workloads. But I mean a few basic
things like the C and MPI libraries could already be quite big before
you even consider the application text (OK it won't be all paged in).Maybe it won't be enough, but I think some form of recurive reclaim
will be better than our current scheme. Even assuming your patches are
in the kernel, don't you think it is a good idea to _not_ have potentiallyBut the code would end up better, wouldn't it? And it could be done
It wouldn't use reclaimable memory as such, but would have some small
amounts of reserve memory for allocating all those things required to
get a response from critical sockets. NBD for example would also then
be sure to reserve enough memory to at least clean one page etc. That's
the way the block layer has gone, which seems to be pretty good and II don't know if that is a really good advantage. The amount of memory
involved should just be pretty small. I mean it is an advantage, but
there are other disadvantages (imagine the mess if other subsystems used
their own global reserves in the allocator rather than mempools etc). I
don't see why networking is fundamentally more deserving of its own pools
in the allocator than anybody else.
-
Buffered write deadlock? How does that exactly occur? Memory allocation in
the writeout path while we hold locks?There are many worst case scenarios in the current reclaim implementation
that are not addressed and we so far have not addressed these because the
code is very sensitive and it is not clear that the complexity introduced
by these changes is offset by the benefits gained.
-
Different topic. Peter was talking about the write(2) write deadlock
where we take a page fault while holding a page lock (which leads to
lock inversion, taking the lock twice etc.)-
Regular reclaim also cannot immediately write out pages. Writes are
usually deferred. If you have too many anonymous pages in regular reclaim
then you can have the same issues.The difference is that recursive reclaim does not trigger writeout at
the moment but we could address that by having a pageout list that then
starts writes from another context. Then both reclaims would be able to
trigger writeout.-
Only if min_free_kbytes is really the mininum number of free pages and not
the mininum number of clean pages as I suggested.All deadlocks? There are numerous ones that can come about for different
There is no infinite memory. At some point you need to bound the amount
In the general case this is true even for an MPI job because the MPI job
needs to have executable code and libraries in memory. At mininum theseIt is workable. If you crank the min_clean_pages (this is essentially
what it is) up to 20% then you basically reserve 20% of your memory for
executable pages and page cache pages. And in an emergency these can be
reclaimed to resolve any OOM issues. Note that my patch only accessesBut that is an issue that is better handled in the network stack.
-
A minimum enforced reclaimable non dirty threshold wouldn't be
that ridiculous though. So the memory could be used, just not
for dirty data.His patchkit essentially turns the GFP_ATOMIC requirements
from free to easily reclaimable. I see that as an general improvement.I remember sct talked about this many years ago and it's still
a good idea.-Andi
-
Sure, and note that various patches to such an effect have already been
posted (even one by myself), they introduce a third reclaim list on
which clean pages live. If you add to that a requirement to keep that
list at a certain level, one could replace part (or all) of the reserves
with that.But that is more an optimisation rather than anything else.
The thing I strongly objected to was the 20%.
Also his approach misses the threshold - the extra condition needed to
break out of the various network deadlocks. There is no point that says
- ok, and now we're in trouble, drop anything non-critical. Without thatThat is his second patch-set, and I do worry about the irq latency that
that will introduce. It very much has the potential to ruin everything
that cares about interactiveness or latency.Hence my suggestion to look at threaded interrupts, in which case it
would only ruin the latency of the interrupt that does this, but does
not hold off other interrupts/processes. Granted PI would be nice to
ensure the threaded handler does eventually finish.
Well then set it to 10%. We have min_free_kbytes now and so we are used
Where is the patchset introducing additional latencies? Most of the time
it only saves and restores flags. We already enable and disable interrupts
in the reclaim path but we assume that interupts are always enabled when
we enter reclaim.-
I proposed a way to avoid increasing interrupt latency
in a simple way.-Andi
-
No it doesn't. All memory can be tied up by anonymous pages - who are
-
Ok but that could be addressed by making sure that a certain portion of
memory is reserved for clean file backed pages.-
Which gets us back to the initial problem of sizing this portion and
ensuring it is big enough to service the need.-
Clean file backed pages dominate memory on most boxes. They can be
calculated by NR_FILE_PAGES - NR_FILE_DIRTYOn my 2G system that is
Cached: 1731480 kB
Dirty: 424 kBSo for most load the patch as is will fix your issues. The problem arises
if you have extreme loads that are making the majority of pages anonymous.We could change min_free_kbytes to specify the number of free + clean
pages required (if we can do atomic reclaim then we do not need it
anymore). Then we can specify a large portion of memory for
min_free_kbytes. 20%? That would give you 400M on my box which would
certainly suffice.If the amount of clean file backed pages falls below that limit then do
the usual reclaim. If we write anonymous pages out to swap then they
can also become clean and reclaimable.-
| Ian Campbell | Re: [PATCH] x86: Construct 32 bit boot time page tables in native format. |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Justin Piszcz | Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read & 195... |
| Alan | Re: [RFC] Heads up on sys_fallocate() |
| Matthias Scheler | Re: HEADS UP: timecounters (branch simonb-timecounters) merged into -current |
| David Laight | long usernames |
| Quentin Garnier | Re: Understanding foo_open, foo_read, etc. |
| Jared D. McNeill | Breaking binary compatibility for /dev/joy |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
