Poul-Henning noticed today that xchat fails to start if malloc uses sbrk internally. This failure happens during the first call to malloc, with the following message: Fatal error 'Can't allocate initial thread' at line 335 in file /usr/src/lib/libthr/thread/thr_init.c (errno = 12) This can be worked around with MALLOC_OPTIONS=dM . The problem does not appear to be specific to jemalloc; I reverted src/lib/libc/stdlib/malloc.c to revision 1.92 (last phkmalloc revision), which also uses sbrk, and the failure mode is the same. The failure occurs on both i386 and amd64. It appears that sbrk(0) returns an address that is in the address range normally used by mmap. So, the first call to sbrk with a non-zero increment is fantastically wrong. On i386 (ktrace output): 1013 xchat CALL break(0x28200000) 1013 xchat RET break -1 errno 12 Cannot allocate memory On amd64 (truss ouput): break(0x800900000) ERR#12 'Cannot allocate memory' sbrk is not a true system call, so it seems like the problem should have something to do with the _end data symbol. I looked at it in gdb though and never saw an unreasonable value, despite bogus sbrk(0) results. I do not know offhand how to get the addresses of .minbrk and .curbrk (register inspection within gdb while stepping through sbrk?), which are what sbrk actually uses (see src/lib/libc/amd64/sys/sbrk.S). Perhaps the loader isn't initializing them correctly... I am quite pressed for time at the moment, and cannot look into this in any more detail for at least a couple of weeks. If anyone knows what the problem is, please let me know. Thanks, Jason _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Malloc() itself knows about memory amount _really_ in use by a program and could check it don't go beyond the limits, but for this it needs run-time check via getrlimit() call for each malloc() call (a program can use setrlimit() by itself). Traking direct mmap()s and sbrk()s outside of malloc() is also needed. -- http://ache.pp.ru/ _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
No, the VM system has a much better idea about this. You need to think about this the right way: There is address space allocated to the process (via sbrk/mmap) A subset of this, is address space allocated by the program (via malloc) ...and then there is memory actually in use, which is an entirely different thing, of which we currently only have some kind of clue in the VM system. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Then, we need sysctl to fetch that "memory actually in use" from the kernel and compare that with getrlimit() which allows malloc() to return 0 when needed. -- http://ache.pp.ru/ _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
That won't help much -- malloc could have allocated some address space that hasn't (yet) been touched by the process. Just returning 0 when the amount of memory "in use" hits a limit wouldn't stop the process from then touching all the memory it has previously been allocated and exceeding the limit. -- David Taylor _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
In that case the process is subject to be killed by system, if exceeds its limits. But... this is not malloc() problem at all, malloc() designed to detect overflow situation, not prevent it. The malloc() problem is not returning 0. -- http://ache.pp.ru/ _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I cannot say definitely what happen, but please note that the _end
symbol is defined by linker script, and it shall be present in all
executable and shared objects. The value you reported would be naturally
the _end value for some shared object.
I tried both the RELENG_7 and HEAD, and sbrk(0) correctly returns a
seemingly valid value like 0x8049644.
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
int
main(int argc, char *argv)
{
void *p;
p =3D sbrk(0);
printf("%p\n", p);
return (0);
}The real question is why we would revert perfectly good code (jemalloc) from using a modern interface to using one that has been obsolete for twenty years, and marked as such in the man page for seven years. If rwatson@ wants malloc() to respect resource limits, he can bloody well fix mmap(). Until he does, the datasize limit is a joke anyway, as anyone can circumvent it by either using mmap() instead of malloc() or setting _malloc_options before calling malloc(). DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
The issue here was that there were a number of reports that out-of-control= =20 applications were toasting systems that weren't getting toasted under 6.x. = I=20 experienced this on my web server, but the ports build cluster has been=20 running into it for months. The symptom is that a single application exhau= sts=20 swap, causing all sorts of things to break (tm), killing of other large=20 processes, etc. To be clear, in the new world order, instead of getting NU= LL=20 back from malloc(3), SIGKILL is delivered to large processes. When I e-mailed Jason Evans and Alan Cox about it, I suggested that we=20 actually teach malloc(3) to enforce an allocation limit itself by querying = a=20 limit once at process startup, and then using its own accounting to decide= =20 when to start failing requests. As an alternative model that would require= =20 some more infrastructural changes, I suggested a new mmap() flag that hinte= d=20 to the kernel that the page should count against a swap/anonymous memory=20 limit, but that we should avoid more serious changes at the last minute bef= ore=20 a release. Alan suggested the the model Jason ended up implementing as a= =20 lower risk way to restore the 6.x resource limits non-disruptively. As it= =20 turned out, this proved much more complicated than expected. The right answer is presumably to introduce a new LIMIT_SWAP, which limits = the=20 allocation of anonymous memory by processes, and size it to something like = 90%=20 of swap space by default. Since that won't be happening before 7.0, I beli= eve=20 the consensus is to simply not MFC the changes for 7 and proceed with the= =20 release. However, having a resource limit on swap use in order to prevent = the=20 above scenario is actually quite important: SIGKILL of arbitrary processes = is=20 not a good way to deal with one run-away process, and the virtual memory si= ze=20 limit, while also useful, prevents you from limiting the allocation of swap= =20 without also ...
Huh??? Again, huh??? _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
FreeBSD allows memory overcommit, both overcommit of physical memory resulting in paging, and overcommit of swap space. For the last few years, resource limits on the data segment size, previously observed by malloc(), have prevented processes from mallocing enough memory individually to exhaust swap on 32-bit systems. This is arguably a bug, because you actually want a single process to be able to allocate enough memory to fill its address space, but because the data segment size is used to make address space layout decisions from the inception of the process, is rather inate to using sbrk(). Jason's new malloc uses mmap() of anonymous memory, which isn't affected by the data segment limit, and hence, as a feature, isn't limited by the resouce limit. This turns out to be awkward if you have a run-away process, as where previously it would simply get back an error when it tried to exceed its resource limit, now it simply consumes all your swap, which then results in overcommit. My hope was that we could re-introduce a resource limit on malloc'd memory without large changes, but that appears to have been more tricky than hoped. The goal is not to prevent overcommit, which is invaluable in UNIX systems due to the fork() model which pretty much pre-supposes it by design, rather, to prevent exhaustion of swap by a single process if not specifically allowed by the administrator (in the same way we limit all sorts of other things, like open files, mbufs, socket buffer memory, etc). The right way to do it is to provide a specifically configurable process limit on swap use, the same way we did for data segment size, only not data segment size, but that was considered likely too risky for 7.0. Robert N M Watson Computer Laboratory University of Cambridge _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@f...
For the same reason as it has for the last 20 years or so: memory overcommit, which means that malloc() allocates address space, not memory. Actual memory is allocated on-demand when the address space is used (read from or written to). If there is no RAM left and none can be freed by swapping out, the process gets killed. The process that gets killed is not necessarily the memory hog, it is merely the process that is unlucky enough to touch a new page at the wrong moment, i.e. when all RAM and swap is exhausted *or* everything in RAM is wired down and unswappable. Of course, if you're afraid of memory overcommit and you know in advance how much memory you need, you can simply allocate a sufficient amount of address space at startup and touch it all. This way, you will either be killed right away, or be guaranteed to have sufficient memory for the rest of your (process) lifetime. Alternatively, do what Varnish does: create a large file, mmap it, and allocate everything you need from that area, so you have your own private swap space. Just make sure to actually allocate the disk space you need (by filling the file with zeroes, or at the minimum writing a zero to the file every sb.st_blksize bytes, preferably sequentially to avoid excessive fragmentation) or you may run into the same problem as with malloc() if the disk fills up while your backing file is still sparse. The ability to specify a backing file to use instead of anonymous mappings would be a cool addition to jemalloc. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Broadcasting SIGDANGER would be a much better option; followed by SIGTERM to the memory hogger (to allow for graceful termination) and only then SIGKILL. I can imagine a few (legitimate) scenarios when a That would be really cool and even better if it allocated it in a contiguous chunk. Igor :-) _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
That would create a nicely sized 'hole' in the starting blocks. What Dag-Erling describes is the correct(TM) way of making sure that all blocks have been allocated from the backing store of the file. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
We don't currently have SIGDANGER, but the signal code was rewritten years ago to allow more than 32 signals precisely for the purpose of implementing an AIX-like SIGDANGER. This wasn't done, however, and eventually SIGTHR was the first new signal to take advantage of the No. First of all, you're thinking of lseek(), not fseek() Second, an lseek() beyond the end of a file will not actually extend the file. Third, ftruncate() (which *will* extend a file if it is shorter than the requested length) or lseek() followed by write() will not allocate physical disk space except for the data actually written; it will create a sparse file, which when later written to will become fragmented, resulting in horrible performance. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
In message <86myrlahee.fsf@ds4.des.no>, =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= wr SIGDANGER is not what we need. What we need is an intelligent mechanism to tell applications what the overall situation is, so that jemalloc and aware applications can tune their usage pattern to the availability of physical and virtual memory. Instead of the binary "SIGDANGER" indication we need a more gradual state, at the very least three stats: "plenty", "getting a bit tight" and "crunchtime". Having a signal to indicate changes of the state may make sense, but in a crunch, you don't want to wake all processes and page them in, just to tell them that you're short on memory, it would have to be a signal that doesn't schedule the recipient process until something else does. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
This makes memory management in the userland hideously and unnecessarily complicated. It's simpler to have SIGDANGER (meaning, free all you can) -> SIGTERM (terminate gracefully) -> SIGKILL (too late, I'm killing you anyway); and maybe a MIB in sysctl like ...vm.overcommit_action ='soft' being SIGDANGER->SIGTERM->SIGKILL and = 'hard' being SIGKILL, so the sysadmin at least has a choice Igor _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
You don't seem to understand what Poul-Henning was trying to point out, which is that broadcasting SIGDANGER can make a bad situation much, much worse by waking up and paging in every single process in the system, including processes that are blocked and wouldn't otherwise run for several minutes, hours or even days (getty, inetd, sshd, mountd, even nfsd / nfsiod in some cases can sleep for days at a time waiting for I/O) DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
By making the default action for SIGDANGER to be SIG_IGN, this problem would be mostly solved. Only processes that actually care about SIGDANGER and installing the handler for it would require some non-trivial and resource-hungry operation.
In message <20080104134829.GA57756@deviant.kiev.zoral.com.ua>, Kostik Belousov This is a non-starter, if SIGDANGER is to have any effect, all processes that use malloc(3) should react to it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
This depends on what SIGDANGER is supposed to indicate. IMO, a single signal is inadequate - you need a "free memory is less than desirable, please reduce memory use if possible" and one (or maybe several levels of) "memory is really short, if you're not important, please die". The former could reasonably default to SIG_IGN - processes that are in a position to release memory on demand could provide a handler to do so. (This could potentially include malloc returning space on its freelist to the kernel). The latter should default to "terminate process" and a process that considers itself "important" enough can trap it. --=20 Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour.
That's what I have been advocating for the last 10 years... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
That makes the userland side of unnecessarily overcomplicated. If a process handles SIGDANGER then let it do so and assume it's important enough to be left alone, if a process doesn't handle SIGDANGER then send SIGTERM to them then SIGKILL; but in any case SIGTERM *should* precede SIGKILL - the processes ought to be allowed to terminate gracefully. Igor :-) _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
In message <a2b6592c0801070515g37735475kc0922af8f93723ca@mail.gmail.com>, "Igor Yes, but you will not see this complication, it will be hidden in the implementation of malloc(3). Every problem has a simple, easy to understand solution that does not work. SIGDANGER is one of these. It didn't work any good on AIX and it won't do so on FreeBSD either. The problem simply requires more than one bit of feedback information to get a sensible regulation. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
On Mon, 07 Jan 2008 13:18:47 +0000 How could you hide it inside malloc? Would malloc start returning 0 after receiving the "less mem than desirable" signal? Would it ever go back to returning non-zero? I thought that the idea of things like SIGDANGER was that applications would be written to have a mode where they could shut down some aspect of their operation, and free resources. I don't see how you can do that, autonomously, from within malloc? Maybe introduce a special flavour of pointer value, returned by a special version of malloc for "cache" objects, that the system is allowed to automatically reclaim? Then programs would need to be able to handle SIGSEGV when accessing those... Cheers, -- Andrew _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I'm with Andrew on this one. The only (sensible) way I could see it being hidden behind malloc() is if malloc() blocks until sufficient memory becomes available. I thought the real idea behind SIGDANGER was to tell the kernel "I kind of know what I'm doing, so if you gonna kill something don't kill me" and that was achieved by AIX not SIGKILLing processes that had sigaction(SIGDANGER) != SIG_IGN. Igor :-) _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
In message <a2b6592c0801071606g4c0dcb9ap117e345fda5e7e5f@mail.gmail.com>, "Igor You should read some recent literature on malloc(3), my own and Jasons papers are good places to start. For performance reasons, malloc(3) will hold on to a number of pages that theoretically could be given back to the kernel, simply because it expects to need them shortly. Such parameters and many others of the malloc implementation can be tweaked to "waste" more or less memory, in response to a sensibly granular indication from the kernel about how bad things are. Also, many subsystems in the kernel could adjust their memory use in response to a "memory pressure" indication, if memory is tight, we could cache vnodes and inodes less agressively, if things are going truly bad, we can even ditch all non-active entries from these caches. If one implements this with three states: Green - "all clear" Yellow - "tight" - free one before you allocate one if you can. Red - "all out" - free all that you sensibly can. And implemented strategies like I propose above (and have proposed for the last 10 years), then it is very unlikely that the system would ever get into the red state, because the yellow state will mitigate and reduce the memory pressure. Nothing prevents an intelligent process from listening in and doing sensible things, firefox could ditch the memory cache of pages for instance. But we can't get anywhere until some VM wizard produces the three "lamps" for us to look at in the first place, that's where we have been stuck for the last 10 years. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-uns...
Although the primary concern is malloc(), I would like to point out that=20 various programs implementing copying garbage collection could more=20 efficiently give memory back to the system than malloc(), and could therefo= r=20 benefit more than malloc() from some kind of feedback from the kernel. There was concern over the complexity involved with intelligently doing=20 something about the memory pressure hints in userspace, but this does not=20 apply here since the allocator/garbage collection would be the equivalent o= f=20 malloc() and complexity there would not affect application code. The problem with malloc() being that, unless I am missing something, malloc= =20 will never be able to give back memory to the kernel except insofar as the= =20 memory mapped is continuously unused between some location and the break (i= n=20 the case of sbrk()) or over the entire range (mmap()). malloc() cannot forc= e=20 this to be the case, since pointers must remain valid. The possibility of=20 reclamation is then often going to be limited to completely unused space=20 being held by malloc() for future use, rather than also applying to areas=20 already used for allocation. Programs implementing copying GC, or able to for some other reason to move= =20 allocated memory around, could compact the heap and give back left-over=20 memory. In some cases this would only entail a temporary improvement due to= =20 defragmentation, but in others (such as a long-running program spiking in=20 memory use, only then to drop a lot of that memory) it could have a pretty= =20 massive effect on memory use. Where a malloc() using program might be unable to sbrk() or munmap() becaus= e=20 there happens to be some left-over non-free piece of memory at the top of t= he=20 mapped range, a GC could use indications from the system to ensure this is= =20 not the case (depending on details of the implementation; for example,=20 compactation of tenured generations could be forced early, etc). (This i...
Actually, malloc(3) can use madvise(2) to notify the kernel that arbitrary pages in the arena are unused and can be discarded. The current implementation will do so if the H option is specified. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Ah, interesting. I was not aware of that. However, in this context it will likely only help partially since you still= =20 need a full page to be free (and with a lot of programs many allocations wi= ll=20 be significantly smaller than that, and I have to assume no real-life mallo= c=20 will align all allocations to pages, or the overhead would be extreme). =2D-=20 / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Page-aligning every allocation would be supremely stupid, and jemalloc does so only for allocations larger than a page. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I misread your "no" as "any", so it seems we are in violent agreement. However, most allocators these days are zone or slab allocators (or similar in principle), and are pretty good at minimizing external fragmentation except for pathological cases, which are suprisingly rare in practice. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Can you provide some refs/links, unfortunately googling for I don't think it's the kernel that is being ill-mannered (unless, of course, it's running ZFS ;-)) by eating up the memory, it's the user How do you propose they 'eavesdrop' on the kernel? Baring in mind that most apps nowadays are written for Linux and are hacked to be portable afterwards (just look at the number of patches in the ports tree), it's much simpler to write a signal handler than FreeBSD-kernel I think the problem is not in providing the lamps to indicate the state, but figuring out an algorithm for judging green->yellow and yellow->green transitions... Igor _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
In message <a2b6592c0801071657s43fcc739jac09baedef7b7532@mail.gmail.com>, "Igor http://phk.freebsd.dk/pubs -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
On Tue, 8 Jan 2008 00:57:21 +0000 "Igor Mozolevsky" <igor@hybrid-lab.co.uk> wrote: Try PHK+malloc or just phkmalloc for better results. Looking for misspelled acronyms can be a frustrating and futile undertaking indeed :) --=20 Alexander Kabaev
On Tue, 08 Jan 2008 00:17:04 +0000 Aah, OK, so there's some essentially system-level caching going on behind the scenes, and that's readily malleable for this sort of thing. I thought that you were proposing some way to propagate the "yellow" or "red" conditions to user-program activity through malloc, which seems hard, since the only official out-of-band signal there is a zero return. I'll have to track down your papers, though, because I thought that the whole problem revolved around the fact that malloc(3) doesn't hand out physical pages at all: that was left up to the kernel vm pager to do as needed. Is it zeroed (and therefore I agree. That sort of auto-tuning of the space/speed trade-off I imagine that even if the accounting can be managed efficiently, the specification of the specific thresholds would be fairly tricky to specify... Cheers, -- Andrew _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Another aspect of the problem is that applications have come to depend in= =20 malloc(3) returning NULL when memory is getting tight, and while we have ne= ver=20 done exactly that, we have historically had malloc(3) return NULL when we g= et=20 close to the process data segment size. Robert N M Watson Computer Laboratory University of Cambridge
I don't do that any more. Unless the program I'm writing is intended to run for a long time and can gracefully handle an out-of-memory situation (such as denying client requests until the situation improves), I write malloc() wrappers which zero the allocated region before returning to the caller, to force a SIGSEGV and spare the caller from having to check the return value. I sometimes also allocate a little bit extra and stick a magic signature and an allocation length in there so my free() wrapper can check for bugs and zero the allocated memory before freeing it. I wouldn't need any of this if my code only ran on FreeBSD, but most of my $DAYTIME_JOB code these days runs on Linux first and FreeBSD second. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Do everyone a favour and research the topic in the archives, please. Another thread on the subject will just waste everyone's time. Kris _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
That will create a sparse file without file system blocks to back it, and is effectively also over-commit. When the file system runs out of room, you will get SIGSEGV when the vnode pager discovers it can't write a page to disk. If you zero-fill it, the blocks are pre-allocated. In a more ideal world, we might support an ioctl or system call to pre-allocate but not hook up the blocks until they were written to, in order to avoid writing lots of zeros to disk, but we don't live in that ideal world yet. Allowing malloc to support alternative sources of pages for memory mapping, such as specific files, would be very neat indeed. Robert N M Watson Computer Laboratory University of Cambridge _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Surely you should not be allowed to overcommit on fseek() followed by write(,,1); zeroing out gigs of hdd space seems rather silly... Igor _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Sparse files are a feature. It just becomes inconvenient at that point because you discover the lack of space asynchronously from a useful user process event. When memory pressure gets high, the vnode pager decides it's time to push a dirty page to disk, and then discovers that there are no free blocks on the file system to write to. As I mentioned in my e-mail, it would be nice if our file system supported a way to reserve blocks for files without hooking them up to the file's visiible address space (in order to avoid zeroing them, which is required if you do want to hook them up for an unprivileged process). However, that feature doesn't currently exist. Many systems with sensitivity to on-demand allocation costs and without security requirements allow files to be extended without zeroing. On systems with security requirements, this becomes a privileged operation (such as on Mac OS X) because exposing unzeroed pages from other files or processes not explicitly shared is Not Allowed. Robert N M Watson Computer Laboratory University of Cambridge _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Even for files which are intended to be filled up immediately, telling the file system ahead of time how much data will be written would allow it to make much better layout decisions. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
Not a good solution on its own. You need a per-process limit as well, otherwise a malloc() bomb will still cause other processes to fail Thank you :) DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
ly. That was what I had in mind, the above should read RLIMIT_SWAP. Robert N M Watson Computer Laboratory University of Cambridge
Robert Watson wrote: > On Fri, 4 Jan 2008, Dag-Erling Sm
Oh, I thought that I was the sole user of the patch. What problems did you encountered while testing it ? What you mean by "do 90% of swap" ?
> > > On Fri, 4 Jan 2008, Dag-Erling Sm
Ok. The patch really imposes two kind of limits: - the total amount of anon memory that could be allocated in the whole system (this is what I called "disabling overcommit") - per-user RLIMIT_SWAP limit, that account the allocation by the uid. This has some obvious problems with setuid(2) syscall. AFAIR, I ended up not moving the accounted numbers to the new uid. Both limits can be turned on/off independently. May be, time to revive it.
> > > > > On Fri, 4 Jan 2008, Dag-Erling Sm
Implementing a per-process limit would help fix the setuid() problem, since the usage of the process calling setuid() would be known and could be transferred to the new user. There could however be a problem when a process creates a MAP_SHARED | MAP_ANON mapping, then fork()s, and the child calls setuid() (think privilege separation). Hopefully, this case is rare enough (malloc() always uses MAP_PRIVATE) that it can be handled using the most restrictive interpretation possible rather than trying to be painstakingly precise. (BTW, Skip, I find your MUA's use of Mail-Followup-To: offensive; if you don't want a copy of the followup, set the followup address to the list, not to a random previous participant in the thread) DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
You don't want the default to be so high. You want a low default, with the possibility for the admin to increase the limit for a particular user in login.conf or similar without rebooting (which is currently not possible since the default datasize == maxdsiz, which can only be changed in the kernel config or loader.conf) You may also want to have a collective limit for unprivileged users, so root will still be able to log in if something goes wrong. DES -- Dag-Erling Smørgrav - des@des.no _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
This will presumably only work for console logins, as sshd (etc) will depen= d=20 on unprivileged users, but perhaps that is fine. I'm less concerned with t= he=20 details of the implementation or policy than that we simply be able to supp= ort=20 even a basic policy and have it configured by default to prevent=20 foot-shooting. Robert N M Watson Computer Laboratory University of Cambridge
I'm not sure that I like that very much. At least the way that it has been explained here so correct me if I misunderstood. I have long lived processes that continuously handle very valuable data and potentially get very large (several GB). I'd like that process to be able to make a rational decision about what happens to its memory contents when an allocation fails rather than having the proverbial rug pulled out from under it. Rug pulling at any point can cost an annual salary or two. Ian -- Ian Freislich _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
I need to make a slight correction there: some time ago the patch at the http://people.freebsd.org/~kib/overcommit/index.html works, at least I believe so. I implemented overcommit turn-off knob and did the exact anonymous memory accounting. Quite possible, the code rotten since then.
That is a pretty damning argument in my mind. Why make such a major change right before the release when it's effectively useless? Scott _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
The motivation for the change is to preserve POLA as malloc() does honor RLIMIT_DATA in previous releases (4.x, 6.x, etc.). That said, I think RLIMIT_VMEM is probably more useful going forward. I know at work we have lots of hacks to deal with maxdsiz and trying to allow apps that use large malloc() and large mmap both cooperate. Having one resource limit for malloc + mmap is probably best for the future. -- John Baldwin _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
If it were happening on a stable branch, I'd agree more with the POLA argument. The tradeoff between last minute destabilization, which is exactly what happened here, and the highly imperfect and antiquated justification, is pretty bogus. Scott _______________________________________________ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
The reason I'm more of a fan of introducing LIMIT_SWAP is that I'd like to be
able to specifically avoid swap exhaustion by a process without preventing it
When Alan proposed this as the approach, it was presumably under the
assumption that it would be non-disruptive. As it has proven highly
disruptive, it's obviously not getting MFC'd for the release. Instead we'll
have to work on a solution for after .0, but make sure to document that the
default swap resource limits effectively enforced in all prior FreeBSD
releases are *not* enforced on 7.0, and that administrators wanting to prevent
users from exhausting swap accidentally with something like the following:
int
main(int argc, char *argv[])
{
char *c;
while (1) {
c = malloc(getpagsize());
if (c == NULL)
err(-1, "malloc");
*c = 'a';
}
}
will need to now manually set the virtual memory limit in login.conf. Note
that the above strongly resembles frequently run CGI scripts written by many
naive CGI script authors, so is something that we'd like to be robust against
in the same way we prefer to be robust against:
int
main(int argc, char *argv[])
{
while (1) {
fork();
}
}
Smacking the user is obviously a good idea, but taking down the multi-user web
server is not.
Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"Also, may I humbly inject a user centric view here - it is pretty annoying = to=20 be limited to 500 MB of mallocable memory on 32 bit machines when you expec= t=20 3 GB to be usable (with 1 GB mapped to the kernel). I scratched my head for a long time as to why I was getting out of memory=20 errors in spite of carefully setting resource limits and ensuring virtual=20 memory was available; at some later point in time I discovered the hard-cod= ed=20 distinction between sbrk():able and mmap():able memory. I am not sure what = I=20 was supposed to find this in the documentation (I found it by chance=20 Googling). If sbrk() is indeed to be used by the default malloc, one definitely user=20 visible annoyance will be the 500 MB limit. At least with mmap() that will = be=20 2.5 GB, unless I am misstaken, which is much closer to what one might expec= t=20 and thus less likely to cause problems in the common case. Changing maxdsize to be > 500 MB is probably bad too, from a user centric=20 view, since you don't want to cause the equivalent problems for programs th= at=20 do not use malloc(), but are indeed coded with "modern virtual memory" (as= =20 the man page calls it) in mind. Better to leave this problem to those=20 programs that use sbrk() directly. Another consequence is that if the sysadmin really wants a maximum amount o= f=20 mmap():able memory
