Re: [RFC] Parallelize IO for e2fsck

Previous thread: possible deadlock shown by CONFIG_PROVE_LOCKING by Carlos Carvalho on Thursday, January 24, 2008 - 6:33 am. (4 messages)

Next thread: [patch 03/26] mount options: fix adfs by Miklos Szeredi on Thursday, January 24, 2008 - 12:33 pm. (2 messages)
From: Bodo Eggert
Date: Thursday, January 24, 2008 - 10:32 am

IMO you'll need a userspace daemon. The kernel does only know about the
amount of memory available / recommended for a system (or container),
while the user knows which program's cache is most precious today.

(Off cause the userspace daemon will in turn need the /proc file.)

I think a single, system-wide signal is the second-to worst solution: All
applications (or the wrong one, if you select one) would free their caches
and start to crawl, and either stay in this state or slowly increase their
caches again until they get signaled again. And the signal would either
come too early or too late. The userspace daemon could collect the weighted
demand of memory from all applications and tell them how much to use.

-

From: Andreas Dilger
Date: Thursday, January 24, 2008 - 3:07 pm

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means "free all of your cache", since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as "flush the oldest 10% of my cache"
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: Adrian Bunk
Date: Thursday, January 24, 2008 - 4:08 pm

I don't think that's something that would require finetuning on a
per-application basis - the kernel should tell all applications once to
reduce memory consumption and write a fat warning to the logs (which
will on well-maintained systems be mailed to the admin).

Your "and tell them how much to use" wouldn't work for most applications 
- e.g. I've worked the last weeks with a computer with 512 MB RAM and no 
Swap, which means usually only 200 MB of free RAM. I've gotten quite 
used to git aborting with "fatal: Out of memory, malloc failed" when 
200 MB weren't enough for git, and I don't think there is any reasonable 
way for git to reduce the memory usage while continuing to run.

In practice, there is a small number of programs that are both the
common memory hogs and should be able to reduce their memory consumption
by 10% or 20% without big problems when requested (e.g. Java VMs,
Firefox and databases come into my mind).

And from a performance point of view letting applications voluntarily 
free some memory is better even than starting to swap.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Theodore Tso
Date: Thursday, January 24, 2008 - 4:40 pm

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get "rewarded" by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to

Absolutely.

						- Ted
-

From: Zan Lynx
Date: Thursday, January 24, 2008 - 5:25 pm

Have y'all been following the /dev/mem_notify patches?
http://article.gmane.org/gmane.linux.kernel/628653

--=20
Zan Lynx <zlynx@acm.org>
From: Andreas Dilger
Date: Friday, January 25, 2008 - 4:09 am

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-

From: Zan Lynx
Date: Friday, January 25, 2008 - 5:55 pm

The commentary on the mem_notify threads claimed that the signal is
easily provided by setting up the file handle for SIGIO.

Yeah.  Here it is...copied from email written by KOSAKI Motohiro:

implement FASYNC capability to /dev/mem_notify.

<usage example>
        fd =3D open("/dev/mem_notify", O_RDONLY);

        fcntl(fd, F_SETOWN, getpid());

        flags =3D fcntl(fd, F_GETFL);
        fcntl(fd, F_SETFL, flags|FASYNC);  /* when low memory, receive SIGI=
O */
</usage example>
--=20
Zan Lynx <zlynx@acm.org>
From: KOSAKI Motohiro
Date: Saturday, January 26, 2008 - 4:56 am

BTW:
Of cource, you can receive any signal instead SIGIO by use fcntl(F_SETSIG)  :-)
-

From: Bryan Henderson
Date: Friday, January 25, 2008 - 11:03 am

AIX does provide a system call to find out how much paging backing store 
space is available and the thresholds set by the system administrator. 
Running out of paging space is the only memory pressure AIX is concerned 
about.  While I think having processes make memory usage decisions based 
on that is a shoddy way to manage system resources, that's what it is 
intended for.

Incidentally, some context for the AIX approach to the OOM problem: a 
process may exclude itself from OOM vulnerability altogether.  It places 
itself in "early allocation" mode, which means at the time it creates 
virtual memory, it reserves enough backing store for the worst case.  The 
memory manager does not send such a process the SIGDANGER signal or 
terminate it when it runs out of paging space.  Before c. 2000, this was 
the only mode.  Now the default is late allocation mode, which is similar 
to Linux.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

-

From: Bodo Eggert
Date: Friday, January 25, 2008 - 4:01 pm

If you start partitioning the system into virtual servers (or something
similar), being close to swapping may be somebody else's problem.

This is an interesting approach. It feels like some programs might be 
interested in choosing this mode instead of risking OOM. 
-- 
The programmer's National Anthem is 'AAAAAAAAHHHHHHHH'
-

From: Bryan Henderson
Date: Friday, January 25, 2008 - 6:55 pm

It's the way virtual memory always worked when it was first invented.  The 
system not only reserved space to back every page of virtual memory; it 
assigned the particular blocks for it.  Late allocation was a later 
innovation, and I believe its main goal was to make it possible to use the 
cheaper disk drives for paging instead of drums.  Late allocation gives 
you better locality on disk, so the seeking doesn't eat you alive (drums 
don't seek).  Even then, I assume (but am not sure) that the system at 
least reserved the space in an account somewhere so at pageout time there 
was guaranteed to be a place to which to page out.  Overcommitting page 
space to save on disk space was a later idea.

I was surprised to see AIX do late allocation by default, because IBM's 
traditional style is bulletproof systems.  A system where a process can be 
killed at unpredictable times because of resource demands of unrelated 
processes doesn't really fit that style.

It's really a fairly unusual application that benefits from late 
allocation: one that creates a lot more virtual memory than it ever 
touches.  For example, a sparse array.  Or am I missing something?

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

-

From: Theodore Tso
Date: Saturday, January 26, 2008 - 6:21 am

I guess it depends on how far you try to do "bulletproof".  OSF/1 used
to use "bulletproof" as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in "bulletproof" mode would double,
since while 99.9999% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of "bulletproof" between the
extremes of "totally bulletproof" and "late binding" from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that "100% bulletproof" can require
reserving far more VM resources than you might first expect.  Even a
company ...
From: KOSAKI Motohiro
Date: Saturday, January 26, 2008 - 5:32 am

the mem_notify patch can realize "just before starting swapping" notification :)

to be honest, I don't know fs guys requirement.
if lacking feature of fs guys needed, I implement it with presure if
you tell me it.
-

From: Al Boldi
Date: Saturday, January 26, 2008 - 6:55 am

These notifications are really useful, but it may be much wiser to pipe them 
thru some kevent-notification sub-system, instead of introducing kernel 
notifier-chain end-points left, right, and center.


Thanks!

--
Al

-

From: KOSAKI Motohiro
Date: Saturday, January 26, 2008 - 9:01 am

Aaahh
Your feelings are understood well.
but current design is decided through many poeple discussion.

if anybody need kevent notification, I will add it to the current
implementation instead replace.

thanks.
-

From: Jon Masters
Date: Monday, January 28, 2008 - 4:23 pm

I looked at this a year or two back, then ran out of time. But the thing
I wanted to do was have libc's memory allocation routines extended to
handle these through reservations - the kernel should send a userspace
notification and then there should be some kind of concept of returning
memory that's been used for "opportunistic" userspace caching, e.g. in
firefox to cache the last 10 web pages. Let us know how you get on :)

Jon.


-

From: KOSAKI Motohiro
Date: Sunday, February 3, 2008 - 6:38 am

sorry for late response.
(I didn't notice your mail ;-)

You are right...
stupid user space caching is very important problem.

but I think this is no libc problem.
glibc malloc hardly caches the memory.
(its default behavior only caching 128K.)

but some application use large memory for too opportunistic caching.
I understood we need propagandize that using mem_notify to application guys
after it merge mainline.

I have no idea of solve it easily.
-

Previous thread: possible deadlock shown by CONFIG_PROVE_LOCKING by Carlos Carvalho on Thursday, January 24, 2008 - 6:33 am. (4 messages)

Next thread: [patch 03/26] mount options: fix adfs by Miklos Szeredi on Thursday, January 24, 2008 - 12:33 pm. (2 messages)