Ok, that's reasonable. We already handle this case indirectly in -mm.
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch in
-mm makes the oom killer set TIF_MEMDIE for current and return without
killing any other task; it's unnecesary to check if
!test_thread_flag(TIF_MEMDIE) before that since the oom killer will be a
no-op anyway if there exist TIF_MEMDIE threads.
The problem is that the should_alloc_retry() logic isn't checked when the
oom killer is called, we immediately retry instead even if the oom killer
didn't do anything. So if the oom killed task fails to exit because it's
looping in the page allocator, that's going to happen forever since
reclaim has failed and the oom killer can't kill anything else (or it's
__GFP_NOFAIL and __alloc_pages_may_oom() will infinitely loop without ever
returning).
I guess this could potentially deplete memory reserves if too many threads
have fatal signals and the oom killer is constantly invoked, regardless of
__GFP_NOFAIL or not. That's why we have always opted to kill a memory
hogging task instead via a tasklist scan: we want to set TIF_MEMDIE for as
few tasks as possible with a large upside of memory freeing.
I'm wondering if we should check should_alloc_retry() first, it seems like
we could get rid of a few different branches in the oom killer path by
doing so: the comparisons to PAGE_ALLOC_COSTLY_ORDER, __GFP_NORETRY, etc.
Yeah, that's what
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
effectively does.
Pagefault ooms default to killing current first in -mm and only kill
another task if current is unkillable for the architectures that use
pagefault_out_of_memory(); the rest of the architectures such as powerpc
just kill current. So while this scenario is plausible, I don't think
there would be a large number of processes getting killed:
pagefault_out_of_memory() will kill current and give it access to memory
reserves and the oom killer won't perform any needless oom killing while
that is happening.
--