On Fri, 18 Apr 2008 17:37:36 GMT This patch will cause kernels to crash. It has no changelog which explains or justifies the alteration. afaict the patch was not posted to the mailing list and was not discussed or reviewed. --
what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years: $ grep 4K /boot/config-2.6.24-9.fc9 CONFIG_4KSTACKS=y and we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks. So basically the kernel default just follows the common distro default now. (distros and users can still disable it) Ingo --
There has been a dribble of reports - I don't have the links handy, nor did I doubt if you're testing things like nfsd-on-xfs-on-md-on-porky-scsi-driver. Enable CONFIG_DEBUG_STACK_USAGE. Monitor the results. It's so scary that Apparently not. I wouldn't enable it if I had a distro. Anyway. We should be having this sort of discussion _before_ a patch gets merged, no? --
If by other distros you mean RHEL then yes. However, openSUSE, Ubuntu, and Mandriva all still have 8K stacks. I know of no other distributions that default to 4K. -- Shawn --
On Sat, 19 Apr 2008 09:59:48 -0500 centos, oracle and redflag tend to follow the RHEL/fedora settings. To be honest, at this point we're at a situation where * Several very popular distributions have this enabled for 5+ years, apparently without any real issues (otherwise the enterprise releases would have turned this off) * The early "hot known issues" have been resolved afaik, things like block device stacking, and symlink recursion lookups are either no longer recursive, or a lot less recursive than they used to be. There are clear benefits to 4K stacks (no need to reiterate the flamewar, but worth mentioning) * Less memory consumption in the lowmem zone (critical for enterprise use, also good for general performance) * Kernel stacks at 8K are one of the most prominent order-1 allocations in the kernel; again with big-memory systems the fragmentation of the lowmem zone is a problem (and the distros that ship 4K stacks went there because of customer complaints) On the flipside the arguments tend to be 1) certain stackings of components still runs the risk of overflowing 2) I want to run ndiswrapper 3) general, unspecified uneasyness. For 1), we need to know which they are, and then solve them, because even on x86-64 with 8k stacks they can be a problem (just because the stack frames are bigger, although not quite double, there). I've not seen any recent reports, I'll try to extend the kerneloops.org client to collect the "stack is getting low" warning to be able to see how much this really happens. for 2), the real answer there is "ndiswrapper needs 12kb not 8kb" for 3), this is hard to deal with but also generally unfounded... you can use this argument against any change in the kernel. --
and lets observe it that 8K stacks are of course still offered, so if anyone disables 4K stacks in the .config, it will stay disabled. Ingo --
While you change the default, maybe move it also from the "Kernel hacking" menu into the "General setup" menu? An option with default=y is probably not an option that is targeted towards kernel hackers only. -- Stefan Richter -=====-==--- -=-- =--== http://arcgraph.de/sr/ --
Except, apparently, not, at least in my experience. Ask the xfs guys if they see stack overflows on x86_64, or on x86. I've personally never seen common stack problems with xfs on x86_64, but it's very common on x86. I don't have a great answer for why, but That sounds like a very good thing to collect, and maybe if I re-send a "clearly state stack overflows at oops time" patch you can easily keep tabs. Thanks, -Eric --
On Sat, 19 Apr 2008 21:36:16 -0500 if you actually go over on x86, it's not unlikely that you're getting close to the edge on 64 bit. One thing I've learned with the kerneloops.org work is that people don't read ... which makes me think we need to strengthen this part of the kernel. (and then have kerneloops.org collect the issues) If there's a clear pattern in the backtraces we will find it. And then we can fix it... which is absolutely the right thing, I don't think anyone disagrees with that. So yes if you can dig up your patch, yes please! -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
We see them regularly enough on x86 to know that the first question to any strange crash is "are you using 4k stacks?". In comparison, Why? Because XFS makes extensive use of 64 bit types and so stack usage in the critical paths changes by a relatively small amount between 32 bit and 64 bit machines. IIRC, x86_64 only uses about 30% more stack than x86. So given that the stack doubles on x86_64 and we only increase usage (in XFS) from about 1500 bytes to 2000 bytes of stack usage, we have *lots* more stack space to spare on x86_64 compared to 4k stacks on x86.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
MontaVista offers 4k stacks for arm (currently an external patch) and
markets that as a feature to customers, so many of them might use it.
In-kernel the sh and m68knommu ports also offer 4k stacks (for both
archs there's also a defconfig using it), and the mn10300 port contains
an #ifdef but no config option.
The stack problems in the kernel tend to not be in arch code, and if
we don't get i386 to always run with 4k stacks there's no chance that
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Something like nfsd-over-xfs-over-raid is (or was) the most common
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
On Sun, 20 Apr 2008 11:51:04 +0300 Specific cases yes, but such NAS devices have big processors and are not little emdedded CPUs. On an embedded box you know at build time what it will be doing. --
The code in the kernel that gets the fewest coverage at all are our
error paths, and some vendor might try 4k stacks, validate it works in
all use cases - and then it will blow up in some error condition he
didn't test.
6k is known to work, and there aren't many problems known with 4k.
And from a QA point of view the only way of getting 4k thoroughly tested
by users, and well also tested in -rc kernels for catching regressions
before they get into stable kernels, is if we get 4k stacks enabled
unconditionally on i386.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Which you won't fix by changing the x86 defaults. More of a problem in embedded small devices is the 8K allocation failing in the first place - At which point some distros will simply patch it back no doubt. --
Stuff like nfsd, xfs and raid is covered by the x86 defaults.
Red Hat seems to get usable kernels with 4k for some years?
If we get whatever is still missing for 4k working once and then the
coverage of all i386 -rc testers for noticing new issues immediately
there should be no stability reason for distros to patch it back in.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
On Sun, 20 Apr 2008 14:54:55 +0300
You don't get to dictate to people however.
Alan
--
"If we become a great evil avaricious hegemony, I wanna cool uniform"
-- robk
--
Everyone is free to patch whatever stacksize he wants into his kernel.
But the more users will get 4k stacks the more testing we have, and the
better both existing and new bugs get shaken out.
And if there were only 4k stacks in the vanilla kernel, and therefore
all people on i386 testing -rc kernels would get it, that would give a
better chance of finding stack regressions before they get into a
stable kernel.
If a distribution or user then wants to increase it that's his choice
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Heck, maybe you should make it 2k by default in all -rc kernels; that way when people run -final with the 4k it'll be 100% bulletproof, right? 'cause all those piggy drivers that blow a 2k stack will finally have to get fixed? Or leave it at 2k and find a way to share pages for stacks, think how much memory you could save and how many java threads you could run! 4K just happens to be the page size; other than that it's really just some random/magic number picked, and now dictated that if you (and everyting around you) doesn't fit, you're broken. That bugs me. -Eric (yes, I know there are advantages to only allocating a single page for a new thread, but from an "all callchains after that must fit in that space" perspective, it's just a randomly picked number) --
I'm arguing for aiming at having all 32bit architectures with 4k page
size using the same stack size. Not for having -rc kernels differ from
The only architecture that already defaults to 4k stacks is m68knommu,
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Oh, I know. I'm just saying that 4k seems chosen out of convenience for memory management, without any real correlation to what you might actually need to run a thread. They do happen to be roughly equivalent for many cases, but not all. Setting a default which is not safe for several common use cases does not seem wise... I guess what I'm saying is, I don't agree that any callchain which needs more than 4k of stack indicates brokenness that must be fixed, as various posts in this thread seem to suggest. Sure, 1k char buffers on the stack and massive structs and unlimited recursion we can agree on as things to fix, but complex/deep/stacked callchains which don't fit in 4k are much more of a grey area. -Eric --
On Sun, 20 Apr 2008 09:05:40 -0500 it wasn't randomly picked; it was based on 2.4 kernels (where we had 8kb, but that was roughly 2.5Kb or so for the task struct, yes. Adrian is waay off in the weeds on this one. Nobody but him is suggesting to remove 8Kb stacks. I think everyone else agrees that having both options is valuable; and there are better ways to find+fix stack bloat than removing this config option. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
I'm not arguing for removing the option immediately, but long-term we
shouldn't need it.
This comes from my experience of removing obsolete drivers for hardware
for which also a more recent driver exists:
As long as there is some workaround (e.g. using an older driver or
8k stacks) the workaround will be used instead of the getting proper
bug reports and fixes.
As far as I know all problems that are known with 4k stacks are some
nested things with XFS in the trace.
If this class of issues would get fixed one day, why would it be
valuable to also offer 8k stacks long-term? Especially weigthed
against the fact that with only 4k stacks we will have more people
running into stack problems in -rc kernels if any new ones pop up,
resulting in getting more such problems fixed during -rc.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
This "as far as I know" is a problem itself. Is it possible to implement (e.g., using some form of memory protection in hardware, but I am not an expert here) an option with 8k stacks that, however, spams the log if the actual usage goes above 4k, and have this as a default for some time? If 4k stacks are the goal that is almost achieved, then this debugging option should have zero impact on performance. -- Alexander E. Patrakov --
Shouldn't be hard. Use the 8k stack, and have the system mark the second page as "not present" If it ever gets used you get a page fault. The page fault handler then have to mark the page present before returning, as well as queue up some spam (the call chain perhaps) for the log. A less intrusive way is to use 8k stacks as-is, but put a signature in the second page. When the process quits, examine the second stack page to see if the signature got overwritten. This approach will only show that a problem exists, it won't pinpoint exactly what does it. Helge Hafting --
Some number has to be picked. Why fitting in 4k is "bad" and fitting in 8k is "not bad"? Look what happens when this number is too big: Windows is "generous", and as a result Windows drivers routinely need 12k, sometimes 16k of stack. We know it from ndiswrapper. We don't want to go that way, right? Forget about 50k threads. 4k of waste per process is a waste nevertheless. It's not at all unusual to have 250+ processes, and 250 processes with 8k stack each waste 1M. Do you think extra 1M won't be useful to have? It seems that 4k works for everybody sans xfs. Making it work took some effort, but it is already done. Why not use it after all? And since i386 is such a common architecture, other 32-bit arches will be relieved from the burden of hunting down stack overflows which happen only on those arches. (For example, different ABI or different gcc behavior may make $OTHER_ARCH slightly more stack-greedy). God knows non-mainstream arches have enough problems already. -- vda --
If the 1M gives you more reliability (and I think it does) I don't think it is "wasted". Would you trade occasional crashes for 1MB? I wouldn't. Also a typical process uses much more memory than just 4K. If it's not a thread it needs own page tables and from those alone you're easily into 10+ pages even for a quite small process. But even threads in practice have other overheads too if they actually do something. The 4K won't save or break you. [BTW if you're really interested in saving memory there are lots of other subsystems where you could very likely save more. A common example are the standard hash tables which are still too big] The trends are also against it: kernel code is getting more and more complex all the time with more and more complicated stacks of different subsystems on top of each other. It wouldn't surprise me if at some point 8KB isn't even enough anymore. Going into the other direction is definitely the wrong way. -Andi --
Because well-written code in several subsystems, used in combination in common configurations, does not always fit, that is why. Show me the "bug" in an nfs+xfs+md+scsi writeback stack oops and I'm sure it'll get "fixed." But if it's simply complex code that happens to need >4k, I will continue to argue that the limited stack size selection is the problem, not the code running in it. Perhaps not surprisingly, ext4, which is significantly more complex than ext3, has many more individual functions > 100 bytes than ext3 has. As others have said, there is no trend towards smaller, simpler, less interesting, and less functional code which fits in a smaller and smaller footprint in the general case. If someone has a workload and configuration which happens to fit in 4k then turn it on, test the heck out of it, and have fun. I've not seen what I consider to be a convincing argument for making it the default for everyone. -Eric --
Why nfs+xfs+md+ide works? Does scsi intrinsically require more stack than ide? Why xfs code is said to be 5 timed bigged than e.g. reiserfs? 8k stack is limited too. Other Operating System, no doubt in the name of better stability, has even larger stack (16k or more). For what its worth, I do realize that there is a point of diminishing returns and increased pain when one tries to reduce stack usage. Conversely: "If someone is strongly concerned about possibility of stack overflow, then turn on 8k option, and enjoy the benefits of wide testing which is provided by millions of people who run 4k stacks. If _that_ works ok in practice, 8k _ought_ to be 100.00% safe versus stack overflow". These threads about 4k stack seem to degenerate in ping-ponging of these arguments again and again. -- vda --
Luck? With 4k stacks, you really don't need NFS at all - you just have enter memory reclaim at the wrong time (i.e. when something else If we cut the bulkstat code out, the handle interface, the preallocation, the journalled quota, the delayed allocation, all the runtime validation, the shutdown code, the debug code, the tracing Writeback is done under ENOMEM pressure, and XFS can't provide the guarantees mempools need to work. That leaves the stack as the only place we can put the things we need. e.g. the args structures that tell the allocator what to do and retain state between subsequent low level allocation calls use ~250 bytes of stack just by themselves.... We've already chopped off the low hanging fruit, added noinline to every function definition to prevent compiler heuristics from blowing out stack usage by 25% and reduced use of temporary variables as much as possible. There's very little fat left to trim, and still we can't reliably fit in 4k stacks. Patches are welcome - I'd be over the moon if any of the known 4k stack advocates sent a stack reduction patch for XFS, but it seems that actually trying to fix the problems is much harder than resending a one line patch every few months.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
kmem_free() function takes (ptr, size) arguments but doesn't actually use second one. This patch removes size argument from all callsites. Code size difference on 32-bit x86: # size */fs/xfs/xfs.o text data bss dec hex filename 391271 2748 1708 395727 609cf linux-2.6-xfs0-TEST/fs/xfs/xfs.o 390739 2748 1708 395195 607bb linux-2.6-xfs1-TEST/fs/xfs/xfs.o Compile-tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, xfs_flush_pages() does not use some of its parameters, namely: first, last and fiops. This patch removes these parameters from all callsites. Code size difference on 32-bit x86: text data bss dec hex filename 390739 2748 1708 395195 607bb linux-2.6-xfs1-TEST/fs/xfs/xfs.o 390567 2748 1708 395023 6070f linux-2.6-xfs2-TEST/fs/xfs/xfs.o Compile-tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, xfs_flush_pages() flags parameter is declared as uint64_t, but code never pass values which do not fit into 32 bits. All callsites sans one pass zero, and the last one passes XFS_B_DELWRI, XFS_B_ASYNC or zero. These values are defined in enum xfs_buf_flags_t and they all fit in 32 bits. This patch changes type of the parameter and one variable which used to pass it to unsigned int. Code size difference on 32-bit x86: # size */fs/xfs/xfs.o text data bss dec hex filename 390567 2748 1708 395023 6070f linux-2.6-xfs2-TEST/fs/xfs/xfs.o 390507 2748 1708 394963 606d3 linux-2.6-xfs3-TEST/fs/xfs/xfs.o Compile-tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
FWIW this one also seems to make no stack difference, at least on x86_64. Not complaining; just checking it out. :) If you can shink xfs_bmapi, let me know. :) Thanks, -Eric --
FWIW, the path we care about is this path through ->writepage: (submit_bio) _xfs_buf_ioapply 32 xfs_buf_iorequest 0 xfs_buf_iostart 0 xfs_buf_read_flags 0 xfs_trans_read_buf 4 xfs_btree_read_bufs 16 xfs_alloc_lookup 56 xfs_alloc_lookup_eq 16 xfs_alloc_fixup_trees 20 xfs_alloc_ag_vextent_near 76 xfs_alloc_ag_vextent 0 xfs_alloc_vextent 48 xfs_bmap_btalloc 164 xfs_bmap_alloc 0 xfs_bmapi 228 xfs_iomap_write_allocate 116 xfs_iomap 20 xfs_map_blocks 16 xfs_page_state_convert 124 xfs_vm_writepage 12 ------------------------------------- checkstack total: 948 Realistically, the onyl thing we can trim anything off is xfs_bmapi, xfs_bmap_btalloc, xfs_iomap_write_allocate, and xfs_page_state_convert. It's going to take a lot of work to get any significant change into those functions given the complexity of them.... FWIW, if we've come through a syscall, the rest of the trace looks like: __writepage 0 write_cache_pages 100 generic_writepages 0 xfs_vm_writepages 12 do_writepages 0 __writeback_single_inode 36 sync_sb_inodes 40 writeback_inodes 0 balance_dirty_pages_ratelimited_nr 76 generic_file_buffered_write 96 xfs_write 80 xfs_file_aio_write 12 do_sync_write 140 vfs_write 12 -------------------------------------------- total 604 So the normal case uses 604 bytes prior to entering ->writepage. It's when we are already using >2k of the stack when we enter ->writepage that we get into trouble.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Hi David, xfs_qm_dqpurge() does not use flags parameter. This patch removes it. Code size difference on 32-bit x86: # size */fs/xfs/xfs.o Compile-tested only. text data bss dec hex filename 390507 2748 1708 394963 606d3 linux-2.6-xfs3-TEST/fs/xfs/xfs.o 390491 2748 1708 394947 606c3 linux-2.6-xfs4-TEST/fs/xfs/xfs.o Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, xfs_iomap_write_allocate() does not use count parameter. This patch removes it. Code size difference on 32-bit x86: # size */fs/xfs/xfs.o 393457 2904 2952 399313 617d1 linux-2.6-xfs4-TEST/fs/xfs/xfs.o 393441 2904 2952 399297 617c1 linux-2.6-xfs5-TEST/fs/xfs/xfs.o Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, xfs_bmap_add_free and xfs_btree_read_bufl functions use some of their parameters only in some cases (e.g. if DEBUG is defined, or on non-Linux OS :) This patch removes these parameters using #define hack which makes them "disappear" without the need of uglifying every callsite with #ifdefs. Code size difference on 32-bit x86: 393457 2904 2952 399313 617d1 linux-2.6-xfs6-TEST/fs/xfs/xfs.o 393441 2904 2952 399297 617c1 linux-2.6-xfs7-TEST/fs/xfs/xfs.o Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, Seven xfs_trans_XXX functions declared in xfs_trans.h are not using "tp" parameter in non-debug builds, but it still takes stack space since these functions are not static and gcc cannot optimize it out. This patch removes these parameters using #define hack which makes them "disappear" without the need of uglifying every callsite with #ifdefs. Code size difference on 32-bit x86: 393441 2904 2952 399297 617c1 linux-2.6-xfs7-TEST/fs/xfs/xfs.o 393289 2904 2952 399145 61729 linux-2.6-xfs8-TEST/fs/xfs/xfs.o Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda --
[ resend: now with patch attached! :) ] Hi David, Seven xfs_trans_XXX functions declared in xfs_trans.h are not using "tp" parameter in non-debug builds, but it still takes stack space since these functions are not static and gcc cannot optimize it out. This patch removes these parameters using #define hack which makes them "disappear" without the need of uglifying every callsite with #ifdefs. Code size difference on 32-bit x86: =9A393441 =9A =9A2904 =9A =9A2952 =9A399297 =9A 617c1 linux-2.6-xfs7-TEST/f= s/xfs/xfs.o =9A393289 =9A =9A2904 =9A =9A2952 =9A399145 =9A 61729 linux-2.6-xfs8-TEST/f= s/xfs/xfs.o Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> =2D- vda
Hi David, Inline functions xfs_dir2_dataptr_to_byte and xfs_dir2_byte_to_dataptr are not using their 1st argument. gcc is able to optimize that out. I still want to delete these parameters, as they serve no useful purpose and by removing them I can make gcc to notice some additional unused variables in the callers of these inlines, and warn me about that. There is no object code size difference from this change. Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, This patch deals with remaining cases of unused parameters in fs/xfs/quota/* as far as I can see so far. The rest of unused parameters in fs/xfs/quota/* cannot be easily eliminated due to addresses of functions being taken. Code size difference on 32-bit x86: 393289 2904 2952 399145 61729 linux-2.6-xfs8-TEST/fs/xfs/xfs.o 393236 2904 2952 399092 616f4 linux-2.6-xfs9-TEST/fs/xfs/xfs.o Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Hi David, Inline function xfs_put_perag() in fs/xfs/xfs_mount.h is a no-op. This patch converts it to no-op macro. As a result, gcc will emit warning about unused variables, parameters and so on not in this function, but in its callers, which is more useful. This patch, together with previous ones, has already resulted in more unused params discovered and warned about by gcc. There is no object code size difference from this change. Compile tested only. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> -- vda
Denys, thanks for going through all this; I didn't mean to discount the work with the stackcheck reports. I've done a lot of similar xfs pruning in the past, and every little bit helps. It is still hard to find significant reductions in the critical callchains though! If the xfs codebase gets to the point where things are fairly well cleaned up it might be nice to add the gcc warning to the makefiles, add unused attributes to the vfs ops vectors as needed, and keep it clean from this point on... Thanks, -Eric --
xfs_put_perag() is paired with xfs_get_perag() and should never be called by itself. It is a stub for AG reference counting the in-memory per-ag structures and, in future, locking to allow us to avoid certain deadlocks that can occur (rarely) when growing and shrinking the filesystem. Also, I've got patches that put stuff in this function, so I'd prefer to leave it as it is right now... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
I'd just kill the parameters to xfs_qm_hold_quotafs_ref and xfs_qm_rele_quotafs_ref and I wouldn't worry about removingthe debug-only id parameter to xfs_qm_dqread as it's not in a stack critical path. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Same as my last comments - I don't think the savings are worth the additional clutter it introduces. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Elimination of completely unused parameters makes sense, but IMHO using
such #define hacks for minuscule code size and stack usage advantages is
not worth it.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
In busybox this trick is used extensively. I don't know how to eliminate these unused parameters with less intervention, but I also don't want to leave it unfixed. I want to eventually reach the state with no warnings about unused parameters. -- vda --
Busybox does not have more than one million lines changed in
one release.
In the Linux kernel maintainability is much more important than in
The standard kernel pattern in using empty static inline functions (that
allow type checking).
And I'm not sure whether the number of functions you'd have to change
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
It would be a huge undertaking. Just building xfs w/ the warning in place exposes tons of unused parameter warnings from outside xfs as well. But, if it was deemed important enough, you could go annotate them as unused, I suppose, and hack away at it... Does marking as unused just shut up the warning or does it let gcc do further optimizations? -Eric --
Eh... I meant "no warnings about unused parameters" for fs/xfs/* only, not for the entire kernel. I filter out other warnings. I want to do it not as an excercise in perfectionism, but as means of making sure we do not waste stack passing useless parameters, which is important for xfs. -- vda --
That's not really maintainable, and the stack gains are too small for
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Why? Adding -Wunused -Wunused-parameter in fs/xfs/Makefile: EXTRA_CFLAGS += -I$(src) -I$(src)/linux-2.6 -funsigned-char #EXTRA_CFLAGS += -Wunused -Wunused-parameter and making a test build with it uncommented once in a while will reveal a bit of fallout, which is then fixed. busybox source is thrice as big as xfs source and from the experience I'd say it's not difficult I promise to take a look at the critical (wrt stack use) path next. -- vda --
The problem isn't in the Makefile, the problem are the ugly #ifdef's in
the code.
And for getting the stack problems fixed the effect is anyway by two
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
It just shuts up the warning. It is still useful - suppresses false positives. I didn't check whether gcc is clever enough to reuse stack space occupied by unused parameter(s) as a free space for automatic variables. In theory it is allowed to do that and reduce stack usage that way. -- vda --
We don't use pre-processor hacks to hide function variables for different config options. The XFS header files are messy enough without adding additional redefinitions of function types to them. w.r.t xfs_bmap_add_free(), the correct thing to do is to factor the debug code out into a different function that is only compiled on debug kernels and remove all the debug checks from xfs_bmap_add_free(). As it is, I don't think that the change is worth the maintenance cost for a few bytes of stack space in non-critical paths. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Hmmm - I'm wondering if that is actually a bug. Certainly the code is in conflict with the comment for the function, and it points out that I could have fixed a recent bug in a better way. I'm going to hold off this one until I've had time to look at this in more detail.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Ok. Will test. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
FYI: if you want to sumbit xfs patches it makes a lot of sense to send them to the xfs list.. --
Can you fold this into the previous patch that kills fiopt to this function? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
FWIW this one actually does not seem to reduce stack usage anywhere. -Eric --
I hope this will not deteriorate into a contest whether every particular patch reduces stack usage or not, but: You do not see reduced stack usage in "make checkstack", because "make checkstack" shows only stack usage caused by local variables (it analyses sub %esp,NN instructions which make room for them). Parameters also take up stack, but they are pushed on stack with push instruction, and so are invisible in "make checkstack" output. -- vda --
That on i?86 actually depends on whether -maccumulate-outgoing-args is on or off (the default is off for -Os and most pre-i686 tunings, and on for i686 and most post-i686 tunings when not -Os). Jakub --
I trust you know it better than I. I removed a few parameters of non-static, non-inline function. Since at call site gcc has no way of knowing that these parameters will not be used by callee, and the function is not regparm (explicitly or implicitly by being static), I am fairly sure gcc is putting these parameters on stack. "make checkstack" doesn't see any difference. It can only mean that "make checkstack" does not account for stack space taken by parameters, not that there is no difference in stack usage after this change. That is simply not possible IMO. -- vda --
Sorry if you took it that way; since the patch was in response to Dave's mention of accepting stack-reducing patches, I thought it was worth checking and highlighting whether it seemed to help. It wasn't supposed Hm, I had assumed that the %esp subtraction also made room for the arguments pushed onto the stack. Is there no way to analyze that part? Thanks, -Eric --
These were never removed because they are place holders for stuff that Linux didn't support when the original port was done. Now Linux supports range flushes, these functions should be changed to do that, and hence the first/last parameters will be used. But the fiopt flag can probably be killed.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
I didn't expect it to but this does reduce a few things slightly. On x86_64: -xfs_attr_leaf_list_int 200 +xfs_attr_leaf_list_int 184 -xfs_dir2_sf_to_block 136 +xfs_dir2_sf_to_block 120 -xfs_ifree_cluster 136 +xfs_ifree_cluster 120 -xfs_inumbers 184 +xfs_inumbers 168 -xfs_mount_free 24 Thanks, -Eric --
And on x86, just for the record (fedora 9 config in both cases...) -xfs_attr_leaf_inactive 36 +xfs_attr_leaf_inactive 32 -xfs_attr_shortform_list 40 +xfs_attr_shortform_list 36 -xfs_da_grow_inode 96 +xfs_da_grow_inode 92 -xfs_dir2_grow_inode 116 +xfs_dir2_grow_inode 104 -xfs_dir2_leaf_getdents 176 +xfs_dir2_leaf_getdents 172 -xfs_dir2_sf_to_block 92 +xfs_dir2_sf_to_block 88 -xfs_ifree_cluster 108 +xfs_ifree_cluster 104 -xfs_inumbers 88 +xfs_inumbers 84 -xfs_lock_inodes 24 +xfs_lock_inodes 28 -Eric --
Ack. Pulled into my qa tree. FWIW, can you send patches in line next time? It makes it easier to quote them on review.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
At yet, I got four screenfuls of fs/xfs/XXXXX.c: warning: unused parameter 'foo' when I added -Wunused_parameter to Makefile. Sent a few. I would like to ask you to ACK/NAK every individual patch in some reasonable period of time, say, 1-3 days. If you NAK a patch, please let me know what is wrong with it. I am not eager at all to experience a repeat of aic7xxx patch saga, when I was not getting any meaningful reply for months. Best regards, Denys. -- vda --
I know the feeling of resending patches again and again without any
reaction quite well, but that's not David's fault and not true for XFS
patches, so when you try to put pressure on him you hit the wrong
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Just noticed this bit of FUD. Last time I did some static analysis on stack usage, reiserfs alone would blow away 3k, while xfs was somewhere below. Reiserfs was improved afaik, but I'd still expect it to be worse than xfs until shown otherwise. Maybe reiserfs simply isn't used that much in nfs+*fs+md+whatnot+scsi setups? Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon --
I'm sorry, but it's not what I said. I didn't say reiserfs eats less stack. I don't know. I said it is smaller. reiserfs/* 821474 bytes xfs/* 3019689 bytes -- vda --
FWIW, the reason for that is in large part all the features Dave listed above, and probably more. And, while certainly not yet tiny, the recent trend actually is that xfs is getting a bit smaller: http://oss.sgi.com/~sandeen/xfs-linedata.png (note, though - the Y axis does not start at 0) :) -Eric --
One way they do that is by marking significant parts of the kernel unsupported. I don't think that's an option for mainline. -Andi --
But you have to first ask why do you want 4k tested? Does it serve any useful purpose in itself? I don't think so. Or you're saying it's important to support 50k kernel threads on 32bit kernels? -Andi --
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Clearly if I have the choice between a kernel which can run 50k threads and a kernel which does not crash under me during an I/O error, I choose the later! I don't even imagine what purpose 50k kernel threads may serve. I certainly can understand that reducing memory footprint is useful, but if we want wider testing of 4k stacks, considering they may fail in error path in complex I/O environment, it's not likely during -rc kernels that we'll detect problems, and if we push them down the throat of users in a stable release, of course they will thank us very much for crashing their NFS servers in production during peak hours. I have nothing against changing the default setting to 4k provided that it is easy to get back to the save setting (ie changing a config option, or better, a cmdline parameter). I just don't agree with the idea of forcing users to swim in the sh*t, it only brings bad reputation to Linux. What would really help would be to have 8k stacks with the lower page causing a fault and print a stack trace upon first access. That way, the safe setting would still report us useful information without putting users into trouble. Willy --
I don't know either but it was quoted to me earlier as the primary So you're saying that only advanced users who understand all their CONFIG options should have the safe settings? And everyone else the "only explodes once a week" mode? For me that is exactly the wrong way around. If someone is sure they know what they're doing they can set whatever crazy settings they want (given there is a quick way to check for the crazy settings in oops reports so that I can ignore those), but the default should be always safe and optimized for reliability. -Andi --
That means we'll have nearly zero testing of the "crazy setting" and
when someone tries it he'll have a high probability of running into some
problems.
Such a "crazy setting" shouldn't be offered to users at all.
We should either aim at 4k stacks unconditionally for all 32bit
architectures with 4k page size or don't allow any architecture
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
I agree you make a valid point here. Then wouldn't it be easier to simply remove 4k and agree it was a wet dream ? Willy --
If the sh maintainer and the m68knommu maintainer (and perhaps
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
I have suggested before that the solution is to allocate memory in "stack size" units (obviously must be a multiple of the hardware page size). The reason allocation fails is more often fragmentation than actual lack of memory, or so it has been reported. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
I've seen many bugs in error paths in the kernel and fixed quite a
few of them - and stack problems were not a significant part of them.
There are so many possible bugs (that also occur in practice) that
What actually brings bad reputation is shipping a 4k option that is
known to break under some circumstances.
And history has shown that as long as 8k stacks are available on i386
some problems will not get fixed. 4k stacks are available as an option
on i386 for more than 4 years, and at about as long we know that there
are some setups (AFAIK all that might still be present seem to include
XFS) that are known to not work reliably with 4k stacks.
If we go after stability and reputation, we have to make a decision
whether we want to get 4k stacks on 32bit architectures with 4k page
size unconditionally or not at all. That's the way that gets the maximal
number of bugs shaken out [1] for all supported configurations before
cu
Adrian
[1] obviously not all, but that's true for all classes of bugs
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
How about making 4k stacks incompatible with those circumstances then? I.e. is you select 4k stacks, then you can't select XFS because we know that _may_ fail. Similiar for ndiswrapper networking, and other stuff where problems have been noticed. Some people don't need any of these, and can then use safe 4k stacks. Well, at least as safe as the 8k stacks are, there is no mathematical proof for their safety in all cases either. Helge Hafting --
Yeah, that means every distro that supports XFS (i.e. pretty much all of them including Fedora) will be forced disable 4k stacks on x86. I'd be happy with this solution. FWIW, this would make 4k stacks pretty much unused outside of custom kernels. At which point I'd suggest a default of 4k is wrong.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Problem is, it's the storage configuration (at administration time, not kernel build time) that matters, too. I have XFS on Fedora with 4k stacks on SATA /dev/sdb1 on my x86 mythbox, and it's perfectly fine. But that's a nice, simple setup. If I stacked more things over/under it, I'd be more likely to have trouble. -Eric --
A good argument for keeping the default 8k and letting people who know what they are doing, or think they do, test their system for 4k operation. Embedded systems typically have far better defined loads than servers or desktops, and are less likely to have different behavior change the stack requirements. That doesn't mean they do less, just that the load is usually better characterized. Vendors shipping a 4k stack kernel are probably not going to be happy if someone nfs exports an xfs filesystem on lvm, running on md raid0 composed of raid5 arrays, containing multipath, iSCSI, SATA and nbd devices. No, I didn't make that up, someone asked me what I thought their problem was with that setup. The kernel is getting more complex, and I don't think that anyone but you is interested in making 4k stacks mandatory, or in eliminating them, either. You frequently take the attitude that something you don't like (like all the old but WORKING network drivers) should be removed from the kernel, so that people will be forced to use the new whatever and find bugs so they can be fixed. Unfortunately in some cases the bugs are never fixed and Linux loses a capability it once had. The arbitrary 4k limit requires a lot of work on dropping stack usage even more than has already been done, and is mostly an effort you want other people to make so you can be happy (I assume that if you were offering to do it all yourself you already would have), and most importantly it would waste a lot of developer effort on a low return goal, which could be used on useful new features or fixing corner case bugs. Or drinking beer... Hell, it wastes your time arguing about it, and you do lots of useful things when you're not trying to force your minimalist philosophy on people. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
.. That's the best suggestion from this thread, by far! Can you produce a patch for 2.6.26 for this? Or perhaps someone else here, with the right code familiarity, could? Some sort of CONFIG option would likely be wanted to either enable/disable this feature, of course. Cheers --
If we want to migrate to 4k sooner or later, this behaviour would not need a config option, maybe just a /proc or /sys tunable to disable the warning. Config would be either (4k + risk of crash) or (8k + warning). The *real* issue is to decide whether we need/want 4k or not, because I think we're still discussing the subject for no reason, as usual... Willy --
Only if you believe that 4K stack pages are a worthy goal. As far as I can figure out they are not. They might have been a worthy goal on crappy 2.4 VMs, but these times are long gone. The "saving memory on embedded" argument also does not quite convince me, it is unclear if that is really a significant amount of memory on these systems and if that couldn't be addressed better (e.g. in running generally less kernel threads). I don't have numbers on this, but then the people who made this argument didn't have any either :) If anybody has concrete statistics on this (including other kernel memory users in realistic situations) The problem with his suggestion is that the lower 4K of the stack page are accessed in normal operation too because it contains the thread_struct. That could be changed, but it would be a relatively large change because you would need to audit/change a lot of code who assumes thread_struct and stack are continuous If that was changed implementing Willy's suggestion would not be that difficult using cpa() at the cost of some general slowdown in increased TLB misses and much higher thread creation/tear down cost etc, Using the alternative vmalloc way has also other issues. But still the fundamental problem is that it would likely only hit the interesting cases in real production setups and I don't think the production users would be very happy to slow down their kernels and handle strange backtraces just to act as guinea pigs for something dubious -Andi --
It is not uncommon for embedded systems to be designed around 16MiB. Some may even have less, although I haven't encountered any of those lately. When dealing in those dimensions, savings of 100k are substantial. In some causes they may be the difference between 16MiB or 32MiB, which translates to manufacturing costs. In others it simply means that the system can cache a bit more and run faster, or it can have a little more functionality. In most cases it simply allows userspace programmers to avoid looking harder to save those 100k, as they are already saved in kernel space. Therefore we made life hard for us in order to make life easier for someone else, saving them time and money. Whether that is worth it depends on your personal point of view. Many embedded people will claim "Hell yes!" Of those that don't, most are simply ignoring currently mainline kernels and will regret the development later. They care, thay just don't tend to care enough to engage in these discussions or even know about them. :( Jörn -- Eighty percent of success is showing up. -- Woody Allen --
But these are SoC systems. Do they really run x86? (note we're talking about an x86 default option here) Also I suspect in a true 16MB system you have to strip down everything kernel side so much that you're pretty much outside If you need the stack you don't have any less cache foot print. If you don't need it you don't have any either. -Andi --
Maybe. I merely showed that embedded people (not me) have good reasons to care about small stacks. Whether they care enough to actually spend This part I don't understand. Jörn -- You ain't got no problem, Jules. I'm on the motherfucker. Go back in there, chill them niggers out and wait for the Wolf, who should be coming directly. -- Marsellus Wallace --
Sure but I don't think they're x86 embedded people. Right now there are very little x86 SOCs if any (iirc there is only some obscure rise core) and future SOCs will likely have more RAM. Anyways I don't have a problem to give these people any special options they need to do whatever they want. I just object to changing the default options on important architectures to force people in completely different setups to do part of their testing. I was just objecting to your claim that small stack implies smaller cache foot print. Smaller stacks rarely give you smaller cache foot print in my kernel coding experience: First some stack is always safety and in practice unused. It won't be in cache. Then typically standard kernel stack pigs are just too large buffers on the stack which are not fully used. These also don't have much cache foot print. Or if you have a complicated call stack the typical fix is to move parts of it into another thread. But that doesn't give you less cache footprint because the cache foot print is just in someone else's stack. In fact you'll likely have slightly more cache foot print from that due to the context of the other thread. In theory if you e.g. convert a recursive algorithm to iterative you might save some cache foot print, but I don't think that really happens in kernel code. -Andi --
On Sun, 20 Apr 2008 20:19:30 +0200 this is what Al did for the symlink recursion thing, and Jens did for the block layer... so yes this conversion does happen for real. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Congratulations, you found three examples in 8.4MLOC. Ok ok I should have said it only happens very rarely (I still stand by that :) Anyways it is moot because it was a miscommunication between me and Joerg. -Andi --
had we done the de-obfuscate-4K-stacks Kconfig change earlier it might have gotten upstream faster. Ingo --
The cache I referred to is called DRAM, not L1. Jörn -- Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats. -- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by Raph Levien, 1979 --
Ah, ok. The question whether 4k stacks should become the default I prefer not touching with an 80' pole. Jörn -- Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz --
Changing the default warning threshold is easy, it's just a #define. Although setting it too low would spam syslogs on some setups. When I was trying to cram stuff into 4k in the past, I had a patch which added a sysctl to dynamically change the warning threshold, and optionally BUG() when I hit it for crash analysis. It was good for debugging, at least. If something along those lines is desired, I could resurrect it. -Eric --
I thought it was checked only at a few places (eg: during irqs). If so, we should set it slightly below the 4k limit if we want users to switch While it's good for debugging, having users tweak the limit to eliminate the warning is the opposite of what we're looking for. We just want to have them report the warning without their service being disrupted. Willy --
Ah, ok I skimmed your first suggestion too quickly. 100% coverage reports on the initial access to the 2nd 4k that way would be nice. Well, it would be nice if we all really wanted 4k stacks some day... :) -Eric --
Andi, you're the only one I've seen seriously pounding the "50k threads" thing - I don't think anyone is really fooled by the straw-man, so I'd suggest you drop it. The real issue is that you think (and are correct in thinking) that people are idiots. Yes, there will be breakages if the default is changed to 4k stacks - but if people are running new kernels on boxes that'll hit stack use problems (that *AREN'T* related to ndiswrapper) and haven't made sure that they've configured the kernel properly, then they deserve the outcome. It isn't the job of the Linux Kernel to protect the incompetent - nor is it the job of linux kernel developers to do such. If people are doing a "zcat /proc/kconfig.gz > .config && make oldconfig" (or similar) the problem shouldn't even appear, really. They'll get whatever setting was in their old config for the stack size. And until the problems with deep-stack setups - like nfs+xfs+raid - get resolved I'd think that the option to configure the stack size would remain. Since the second-most-common reason for stack overages is ndiswrapper... Well, with there being so much more hardware now supported directly by the linux kernel... I'm stunned every time someone tells me "I can't run Linux on my laptop, there is hardware that isn't supported without me having to get ndiswrapper". The last time someone said that to me I pointed to the fact that their hardware is supported by the latest kernel and even offered to build&install it for them. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. --
Ok, perhaps we can settle this properly. Like historicans. We study the
original sources.
The primary resource is the original commit adding the 4k stack code.
You cannot find this in latest git because it predates 2.6.12, but it is
available in one of the historic trees imported from BitKeeper like
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
commit 95f238eac82907c4ccbc301cd5788e67db0715ce
Author: Andrew Morton <akpm@osdl.org>
Date: Sun Apr 11 23:18:43 2004 -0700
[PATCH] ia32: 4Kb stacks (and irqstacks) patch
From: Arjan van de Ven <arjanv@redhat.com>
Below is a patch to enable 4Kb stacks for x86. The goal of this is to
1) Reduce footprint per thread so that systems can run many more threads
(for the java people)
2) Reduce the pressure on the VM for order > 0 allocations. We see
real life
workloads (granted with 2.4 but the fundamental fragmentation
issue isn't
solved in 2.6 and isn't solvable in theory) where this can be a
problem.
In addition order > 0 allocations can make the VM "stutter" and
give more
latency due to having to do much much more work trying to defragment
...
<<
This gives us two reasons as you can see, one of them many threads
and another mostly only relevant to 2.4
Now I was also assuming that nobody took (1) really serious and
attacked (2) in earlier thread; in particular in
Actually the real reason the 4K stacks were introduced IIRC was that
the VM is not very good at allocation of order > 0 pages and that only
using order 0 and not order 1 in normal operation prevented some stalls.
This rationale also goes back to 2.4 (especially some of the early 2.4
VMs were not very good) and the 2.6 VM is generally better and on
x86-64 I don't see much evidence that these stalls are a big problem
(but then x86-64 also has more lowmem).
<<
This was corrected by Ingo who was one of the primary authors of the patch:
no, the primary motivation Arjan and me ...On Sun, 20 Apr 2008 19:26:10 +0200 I'm sorry but I really hope nobody shares your assumption here. These are real customer workloads; java based "many things going on" at a time showed several thousands of threads fin the system (a dozen or two per request, multiplied by the number of outstanding connections) for *real customers*. yes you did attack. But lets please use more friendly conversation here than words like "attack". This is not a war, and we really shouldn't be hostile in this forum, neither What you didn't atta^Waddress was the observation that fragmentation is fundamentally unsolvable. Yes 2.4 sucked a lot more than 2.6 does. But even 2.6 will (and does) have fragmentation issues. I'm sorry but I fail to entirely understand where your "So" or the rest of your conclusion comes from in terms of "both the authors". Which part of "fewer threads" and "8kb versus fragmentation" did you misunderstand to get to your conclusion? -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Several thousands or 50k? Several thousands sounds large, but not entirely unreasonable, No I don't take 50k threads on 32bit serious. And I hope you do not either. Why I don't take it serious: on 32bit 50k threads will lead to lowmem exhaustion if the threads are actually doing something (like keeping select pages around or similar and having some thread local data). You'll easily be at 16-32K/thread and that is already far beyond the lowmem available on any 3:1 split 32bit kernel, likely even beyond 2:2. Even with 3:1 it could be tight. So you can say about customer workloads what you want, but you'll have a hard time convincing me they really run 50k threads doing something on 32bit. Now if we take the real realistic overhead of a thread into account 4k or more less don't really matter all that much and the decreased safety from the 4k stack starts to look Ok what word would you prefer? There is no war involved right, just a technical argument. I previously always assumed that "attacking" was a standard term in discussions, but if you don't like I can switch to another one. Regarding war like terminology: I used to think that people who commonly talk about "nuking code" went a little too far, but at some point I don't see any evidence that there are serious order 1 fragmentation issues on 2.6. If you have any please post it. -Andi --
At 12 threads per request it'd only take about 4200 outstanding requests. That is high, but I can see it happening. At 24 threads per request the number of outstanding requests it takes to reach that is cut in half, to about 2100. That number is more realistic. Since all outstanding requests aren't going to be at the extremes, let us assume that it's a mid-point between the two for the number of outstanding requests - say somewhere around 3150 outstanding requests. While that is a rather high number, if a company - a decently sized one - is using a piece of Java code internally for some reason they could easily have that level of requests coming in from the users. For a website with a decent load that routes a common request to the machine running the code it'd be even easier to hit that limit. So yes, 50K threads *IS* actually pretty easy Just makes you sound foolish. Run the numbers yourself and you'll see that it is easy for a machine running highly threaded code to easily hit 50K threads. Due to me screwing up the configuration of Apache (2) and MySQL I have seen a machine I own hit problems with memory fragmentation - and it's running a 2.6 series kernel (a distro 2.6.17) Because I was able to see that it was a problem I caused I didn't even *THINK* about posting information about it to LKML. I didn't keep the logs of that around - it happened more than three months ago and I clean the logs out every three months or so. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. --
I ran the numbers and the numbers showed that you need > 1.5GB of lowmem with a somewhat realistic scenario (32K per thread) at 50k threads. And subtracting 4k from that 32k number won't make any significant difference (still 1.3GB) If you claim that works on a 32bit system with typically 300-600MB lowmem available (which is also shared by other subsystem) I know who sounds foolish. -Andi --
A question along this line. Why is the Userspace Thread bound to a Kernel-Space Stack at all? I could imagine a solution like Stack Pools assigned only of a Thread enters kernel space, or something like this? Gruss Bernd --
The vast majority of threads are sleeping (with a stack footprint in the kernel). If you have an N-way system, at most N threads can be in userspace at any given moment. You could multiplex several userspace threads on one kernel thread (the M:N model), but it gets fairly complex. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
No, it won't. Which is what I was pointing out. You're hitting a different Never said it worked on a 32bit system. I was pointing out that there can be workloads that do reach that 50K thread-count that you seem to be calling "stupid". As I pointed out later in the message, I *HAVE* run into lowmem starvation on a 32bit x86 system. You thoughtfully removed this, perhaps because you felt it damaged your argument. The machine in question is an old P3 box with less than 1G of memory in it. (Phys+Swap on that machine is only about 1.4G) So yes, on a 32bit machine you run into problems at much, much less of a workload and a much lower thread-count than the magic 50K you are so fond of talking about. If I had been running 4K stacks on that machine I probably would have survived the mis-configuration without the reboot it took to make the machine functional again - I probably would still have reconfigured Apache and MySQL, though - the machine still would have gone largely unresponsive. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. --
Ah your point was that people might do this on 64bit systems? They could indeed. It would not be very efficient but it should work in theory at least with enough memory. Of course they don't need 4k stacks for it. They can also try it on 32bit and it will work to some extent too, just not scale very far. And 4k stack more or less won't make much difference for that because the stack is only a small part of the lowmem needed for a blocked thread with open sockets. Note I didn't come up with that number, it was quoted to me earlier (but one of its authors has distanced itself from it now, so it seems to becoming more and more irrelevant indeed now) Stupid in this case just refers to the general observation that it is quite inefficient to do one thread per request on servers who are expected to process lots of long running connections. Perhaps I could have put that better I will give you that. Please Now that is a very doubtful claim. You realize that a functional network server thread needs a lot more lowmem than just the stack? -Andi --
My point was that people might try to make such a system work on a 32bit system and fail. The fact that the limit does exist and changing the stack size doesn't really help things is a key there. My point is that you can get a few more threads out of a machine with 4K stacks, even on 32bit. Sure, the difference is basically negligible, but it does happen. That extra available space may be the difference between a poorly coded program triggering random crashes (and the OOM killer) and the system surviving it. While it's true that I feel that the job of the kernel isn't to protect the incompetent, it should protect the competent admins from the incompetent True. But having that tiny bit of extra memory might be the difference between I didn't say otherwise. I was pointing out that 50K threads isn't out of the question when looking at the workload provided (and ignoring all other memory concerns. However, I had hoped I wouldn't have to spell out the stuff I've had to point Yes, I know you didn't come up with it. But in seeing the original commit-log for it, I'm thinking that the '50K' number was initially meant as either a Remember, you're talking about people that write the code in Java. It's going to spawn all kinds of threads anyway. I, personally, would write the code in a language giving me better control over the available resources. However, I'm not employed by any major company because I will almost always refuse to There was nothing else running on the machine and it was reporting lowmem free in the logs, just none "usable". Since the two biggest hogs on that box are Apache2 and MySQL - and since repairing the Apache2 config damage has halted further OOM's on that machine, I'm pretty much certain that it was Apache2 at fault, though since there were reports of free lowmem, I'm pretty certain it was a combination of fragmentation and Apache2. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly ...
On Sun, 20 Apr 2008 22:01:46 +0200 it is you who keeps putting up the 50k argument. What I'm talking about is in the 10k to 20k range; and that is actual workloads it was in the commit message from me you quoted, and was rather widely discussed at the time. It's also basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say "you should use 64 bit on such machines"; I would love it if more people used 64 bit linux. just like you're posting the evidence that 4k stacks overflows? Google scores: 1-order allocation failed 54000 pages do_IRQ: stack overflow 4560 pages -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
See the links I posted and quote in an earlier message up the thread if you don't remember what you wrote yourself. I originally only hold up the fragmentation argument (or rather only argued against it), until I was corrected by both Ingo and you in the earlier thread and you both insisted that 50k threads were the real reason'd'etre for 4k stacks. You're saying that was wrong and the fragmentation issue was really the real reason for 4k stacks? If both you and Ingo can agree on that On a 32bit kernel? My estimate is that you need around 32k for a functional blocked thread in a network server (8k + 2*4k for poll with large fd table and wait queues + some pinned dentries and inodes + misc other stuff). With 20k you're 625MB into your lowmem which leaves about 200MB left on a 3:1 system with 16GB (and ~128MB mem_map). That might work for some time, but I expect it will fall over at some point because there is just too much pinned lowmem and not enough left for other stuff (like networking buffers etc.) 10k sounds more doable. But again do 4k more or less make a big difference with the other thread overhead? I don't think so. And trading reliability (and functionality -- you basically have to cut off XFS)just for 4k/thread doesn't seem like good bargain to Well if it is that serious a problem surely it will have hit some public bugzillas or mailing lists? Arguing with something secret is also not very useful. Also I find it always important to reevaluate assumptions when new facts come up. In this case we should reevaluate a decision that made sense[1] in 2.4 with the new facts of 2.6 (e.g. new VM with much better reclaim) [1] refering to the fragmentation argument, not the 50k threads which were always unrealistic. -Andi --
On Mon, 21 Apr 2008 01:16:22 +0200 no but the other ones are order 0.. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
with quotes for exact matches: "1-order allocation failed" 790 pages "do_IRQ: stack overflow" 1,880 pages http://www.google.com/search?q=%221-order+allocation+failed%22 http://www.google.com/search?q=%22do_IRQ%3A+stack+overflow%22 -Eric --
On Sun, 20 Apr 2008 22:01:46 +0200 it is you who keeps putting up the 50k argument. What I'm talking about is in the 10k to 20k range; and that is actual workloads it was in the commit message from me you quoted, and was rather widely discussed at the time. It's also basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say "you should use 64 bit on such machines"; I would love it if more people used 64 bit linux. just like you're posting the evidence that 4k stacks overflows? Google scores: 1-order allocation failed 54000 pages do_IRQ: stack overflow 4560 pages -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Also if order 1 allocs were a significant problem on i386 we must have had lots of reports of EAGAIN on fork/clone with !4k stack kernels. I'm not aware of an significant number of such reports (there were a few occasionally, but that is probably normal and unavoidable and can be caused by other things too like simply running out of lowmem) -Andi --
How would I like you being right... Atheros AR5008, AR5414 PHY, "not yet here". It's almost one year now since I bought this laptop, and till now it's the cable or ndiswrapper. But yes, it's going better. For my first wifi laptop I waited two and a half years, now it seems that in a bit more than one there will be an open source driver... I know all the trouble ndiswrapper signify. But I see also that people around me with a laptop and linux use more ndiswrapper than a real driver, so... be gentle with it. Thanks, Romano -- Sorry for the disclaimer --- ¡I cannot stop it! -- La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración. This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation. --
Nobody knows how much potential development is not done because "you can make your wifi work with ndiswrapper". -- vda --
I've got to agree with that sentiment. Once a working solution is found, no matter how crappy, it seems that almost all development stops. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. --
and nobody knows how many people are running linux instead of windows becouse they were able to use ndiswrapper to get things running. most of those people contributed nothing to the kernel, but they all contributed to Linux, if nothing else as examples that Linux is a reasonable option (and some percentage of those users have contrinbuted to other opensource projects that they would probably never have bumped into if they were running windows instead) I know we will never convince each other, but we do need to recognise that there is another valid point of view. David Lang --
And who knows how many more people would be running Linux if they didn't need ndiswrapper at all? And how much better would it be if the drivers were native linux code and were fully supportable because of that? There are many, many reasons why it'd be better if ndiswrapper didn't exist as a solution or if development on native solutions continued on at the level it would without ndiswrapper. DRH -- Dialup is like pissing through a pipette. Slow and excruciatingly painful. --
[Trimmed, I hope I got the authors right...] I understand your position, but let me give my example. I have this laptop that is one year old. I'm helping in all what I can to the development of ath5k --- IOW, offering testing, I am not an expert on this. But the mere fact that ndiswrapper exists enabled me to use this laptop on a daily basis, and so I could test new kernel (and if you look at the logs you'll see I had at least helped to fix a nasty MMC bug, and to make sound work in this laptop) and help in other areas, like suspend/resume testing and bug chasing. There is not only wireless development. Without ndiswrapper, I wouldn't have been in any position to help other areas. I would have had a crippled laptop[1], a much higher Vista uptime (which now is 0), and a far bitter Linux experience. And this is the point of view of someone that is using Linux since 0.99pl9, so I have quite a bit of experience. 99% of normal users would simply say "don't work"[2]. Romano [1] yes, there's a madwifi version locked to a specific kernel that works with my card. But I do not think that this would be so much different. [2] a nice page with "_this_ laptop will fully work with linux" would be nice. Linux on laptop or similar is too complex to be a real help when you have to buy a laptop in 2 days. -- Sorry for the disclaimer --- ¡I cannot stop it! -- La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración. This communication contains confidential information. It is for the exclusive use ...
Romano Giannetti wrote: If sites like tuxmobil.org, hardware4linux.info, and the hardware compatibility databases of Linux distributors don't work for you, then just ask the notebook vendors directly. -- Stefan Richter -=====-==--- -=-- =-=== http://arcgraph.de/sr/ --
Unfortunately, it is quite a complex thing to check. Mind you, I've bought this laptop after looking all over there, but: - tuxmobil & Co are very user-driven, and you have to swim among tenth of "similar" computer; - it's not so easy to know what exactly is bundled with a laptop[1]; - vendor say "works" (and often is listed as works in the aforementioned sites too) independently if it works with an open source driver or not. As an example, all the nvidia-based graphics are marked "works". Romano [1] In my case, I selected this toshiba over for example a HP or an Acer because it had "atheros wifi" (but guess that the PHY version is too new to be supported...), "intel hda sound" (but guess that the specific codec didn't work at all, and continues to have a lot of problems), "intel graphics" (and that at least was a good decision!). -- Sorry for the disclaimer --- ¡I cannot stop it! -- La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración. This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation. --
The nv driver does work for all the nvidia cards as far as I know. Sure you don't get 3D acceleration, but you do get working X. But yes it is quite annoying when companies like highpoint (and others) claim to support linux when all they have is binary blobs as part of their "driver". -- Len Sorensen --
.. That's exactly the worry. If anyone want's to take a crack at testing some of the more likely fail paths there, just introduce a media error onto a SATA disk that's buried at the bottom of a stacked RAID1 over RAID0 over LVM, with XFS and nfsd on top. Or something like that. And then experiment with corrupting meta data rather than simply file data. How-to introduce a media error? hdparm --make-bad-sector nnnnnn /dev/sdX This catches the most likely (IMHO) failure scenarios, but still comes nowhere near 100% code coverage. :( Cheers --
Really, not one? https://bugzilla.redhat.com/show_bug.cgi?id=247158 https://bugzilla.redhat.com/show_bug.cgi?id=227331 https://bugzilla.redhat.com/show_bug.cgi?id=240077 (hehe, ok, xfs is a common component there...) and it's not always obvious that you've overflowed the stack. CONFIG_DEBUG_STACKOVERFLOW isn't ery useful because the warning printk If Fedora is the common distro, ok. :) Fedora is a pretty narrow sample in terms of IO stacks at least. I have plenty of fondness for Fedora, but it's almost 100% ext3[1]. I spent a fair amount of time getting xfs+lvm to survive 4k on F8; gcc caused stack usage to grow in general from F7 to F8, and F9 seems to have gotten tight again but I haven't gotten to the bottom of yet. Heck my ext3-root-on-sda1 pre-beta F9 box, no nfs or lvm or xfs or anything gets within 744 bytes of the end of the 4k stack simply by *booting* (it was a modprobe process... maybe some module needs help) How many other distros use 4K stacks on x86, really? -Eric [1] http://www.smolts.org/static/stats/stats.html shows 24588 ext3 filesystems, compared to 366 xfs, 248 reiserfs, 76 jfs ... --
That could be easily fixed by executing the printk on the interrupt
stack on i386. Currently it is before the stack switch which is wrong
agreed. On x86-64 it should already execute on the interrupt stack. Or
perhaps it would be better to just move the stack switch on i386 into
entry.S too similar to 64bit.
That wouldn't help without interrupt stacks of course, but these
should be always on anyways even with 8k stacks.
Experimental patch appended to do this.
-Andi
---
i386: Execute stack overflow warning on interrupt stack
Previously it would run on the process stack, which risks overflow
an already low stack. Instead execute it on the interrupt stack.
Based on an observation by Eric Sandeen.
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Index: linux/arch/x86/kernel/irq_32.c
===================================================================
--- linux.orig/arch/x86/kernel/irq_32.c
+++ linux/arch/x86/kernel/irq_32.c
@@ -61,6 +61,26 @@ static union irq_ctx *hardirq_ctx[NR_CPU
static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly;
#endif
+static void stack_overflow(void)
+{
+ printk("low stack detected by irq handler\n");
+ dump_stack();
+}
+
+static inline void call_on_stack2(void *func, unsigned long stack,
+ unsigned long arg1, unsigned long arg2)
+{
+ unsigned long bx;
+ asm volatile(
+ " xchgl %%ebx,%%esp \n"
+ " call *%%edi \n"
+ " movl %%ebx,%%esp \n"
+ : "=a" (arg1), "=d" (arg2), "=b" (bx)
+ : "0" (arg1), "1" (arg2), "2" (stack),
+ "D" (func)
+ : "memory", "cc");
+}
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -76,6 +96,7 @@ unsigned int do_IRQ(struct pt_regs *regs
union irq_ctx *curctx, *irqctx;
u32 *isp;
#endif
+ int overflow = 0;
if (unlikely((unsigned)irq >= NR_IRQS)) {
printk(KERN_EMERG "%s: cannot handle IRQ %d\n",
@@ -92,11 +113,8 @@ unsigned int do_IRQ(struct pt_regs ...note that in -rt we have an ftrace plugin that measures _precise_ stack footprint, when it happens. so it's possible to measure exact stack footprint and save a stack trace when that happens. Ingo --
Does anyone still experience problems with 2.6.25?
We all know that there once were problems, but if there are any left
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
There are always problems. You can always come up with something that will crash in 4k, IMHO. Rather than foisting this upon everyone, I'd rather see work put into making stack size a boot parameter or something, so that people can choose what's appropriate for their workload (or their IO stack, if you prefer). --
We are going from 6k to 4k.
Your "You can always come up with something that will crash in" point
would be invariant to this change (although it might be harder to
Why should users have to poke with such deeply internal things?
That doesn't sound right.
Excessive stack usage in the kernel is considered to be a bug.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
I dont know, thet this problem presentiert in 2.6.25, but im older kernels yes (2.6.22> or 2.6.23>). -- Thanks, Oliver --
Do we routinely test nasty scenarii such as a GFP_KERNEL allocation deep in a call stack trying to swap something out to NFS ? Ben. --
I doubt it, because this is the place that a local XFS filesystem typically blows a 4k stack (direct memory reclaim triggering ->writepage). Boot testing does nothing to exercise the potential paths for stack overflows.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
On Thu, 24 Apr 2008 09:36:52 +1000 THe good news is that direct reclaim is.. rare. And I also doubt XFS is unique here; imagine the whole stacking thing on x86-64 just the same ... I wonder if the direct reclaim path should avoid direct reclaim if the stack has only X bytes left. (where the value of X is... well we can figure that one out later) The rarity of direct reclaim during normal use ought to make this not a performance problem per se, and the benefits go further than just "XFS" or "4K stacks". -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
It's bad news actually. Beause it means the stack overflow happens totally random and hard to reproduce. And no, XFS is not unique there, any filesystem with a complex enough writeback path (aka extents + delalloc + smart allocator) will have to use quite a lot here. I'll be Actually direct reclaim should be totally avoided for complex filesystems. It's horrible for the stack and for the filesystem writeout policy and ondisk allocation strategies. --
Just as a data point, XFS isn't alone. I run through once or twice a month and try to get rid of any new btrfs stack pigs, but keeping under the 4k stack barrier is a constant challenge. My storage configuration is fairly simple, if we spin the wheel of stacked IO devices...it won't be pretty. Does it make more sense to kill off some brain cells on finding ways to dynamically increase the stack as we run out? Or even give the robust stack users like xfs/btrfs a way to say: I'm pretty sure this call path is going to hurt, please make my stack bigger now. We have relatively few entry points between the rest of the kernel and the FS, there should be some ways to compromise here. -chris --
On Thu, 24 Apr 2008 11:41:30 -0400, "Chris Mason" Hi, (Rookie warning goes here.) To me, growing the stack at more or less random places in the kernel seems to be quite a complicated thing to do and it will be quite a maintainance burden to find the right spots to insert stack usage checks. So I'ld say: lose the dynamic aspect. How about unconditionally switching stacks at some defined points within the core code of the kernel, just before calling into any driver code, for example? The 4k-option has separate irq stacks already, why not have driver stacks too? I think the most important consideration to keep the stack size small was that non-order-0 allocations are unreliable under/after memory pressure due to fragmentation and that this allocation has to be done for each thread. It is therefore preferable not to do any higher-order allocations at all, unless there is a fall-back mechanism if the allocation fails. For higher-order stacks there isn't such a fallback... Can the system get by (without deadlocks at least in practice) with a limited number of preallocated but 'large' stacks (in addition to a small per-thread stack)? It was discussed that stack space is needed for any sleeping process. Could it be arranged that this waiting happens on the smallish stack, at least for the most common cases, while non-waiting activity can use the big stacks? Greetings, -- Alexander van Heukelum heukelum@fastmail.fm -- http://www.fastmail.fm - A fast, anti-spam email service. --
Yup, note even counting when the said NFS is on top of some fancy network stack with a driver on top of USB .... I mean, we do have potential for worst case scenario that I think -will- blow a 4k stack. Ben. --
