Re: x86: 4kstacks default

Previous thread: Make CONFIG_ARP=m under x86_64 by Jan Engelhardt on Friday, April 18, 2008 - 5:21 pm. (4 messages)

Next thread: Re: x86: spinlock ops are always-inlined by Andrew Morton on Friday, April 18, 2008 - 5:31 pm. (2 messages)
To: Ingo Molnar <mingo@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, April 18, 2008 - 5:29 pm

On Fri, 18 Apr 2008 17:37:36 GMT

This patch will cause kernels to crash.

It has no changelog which explains or justifies the alteration.

afaict the patch was not posted to the mailing list and was not
discussed or reviewed.
--

To: Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 10:23 am

what mainline kernels crash and how will they crash? Fedora and other
distros have had 4K stacks enabled for years:

$ grep 4K /boot/config-2.6.24-9.fc9
CONFIG_4KSTACKS=y

and we've conducted tens of thousands of bootup tests with all sorts of
drivers and kernel options enabled and have yet to see a single crash
due to 4K stacks. So basically the kernel default just follows the
common distro default now. (distros and users can still disable it)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 1:27 am

Do we routinely test nasty scenarii such as a GFP_KERNEL allocation deep
in a call stack trying to swap something out to NFS ?

Ben.

--

To: Benjamin Herrenschmidt <benh@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 7:36 pm

I doubt it, because this is the place that a local XFS filesystem
typically blows a 4k stack (direct memory reclaim triggering
->writepage). Boot testing does nothing to exercise the potential
paths for stack overflows....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 8:56 pm

Yup, note even counting when the said NFS is on top of some fancy
network stack with a driver on top of USB .... I mean, we do have
potential for worst case scenario that I think -will- blow a 4k stack.

Ben.

--

To: T David Chinner <dgc@...>
Cc: Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 8:45 pm

On Thu, 24 Apr 2008 09:36:52 +1000

THe good news is that direct reclaim is.. rare.
And I also doubt XFS is unique here; imagine the whole stacking thing on x86-64 just the same ...

I wonder if the direct reclaim path should avoid direct reclaim if the stack has only X bytes left.
(where the value of X is... well we can figure that one out later)

The rarity of direct reclaim during normal use ought to make this not a performance problem per se,
and the benefits go further than just "XFS" or "4K stacks".

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: T David Chinner <dgc@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Thursday, April 24, 2008 - 5:52 am

It's bad news actually. Beause it means the stack overflow happens
totally random and hard to reproduce. And no, XFS is not unique there,
any filesystem with a complex enough writeback path (aka extents +
delalloc + smart allocator) will have to use quite a lot here. I'll be

Actually direct reclaim should be totally avoided for complex
filesystems. It's horrible for the stack and for the filesystem
writeout policy and ondisk allocation strategies.

--

To: Christoph Hellwig <hch@...>
Cc: Arjan van de Ven <arjan@...>, T David Chinner <dgc@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, <sandeen@...>
Date: Thursday, April 24, 2008 - 11:41 am

Just as a data point, XFS isn't alone. I run through once or twice a month
and try to get rid of any new btrfs stack pigs, but keeping under the 4k
stack barrier is a constant challenge.

My storage configuration is fairly simple, if we spin the wheel of stacked IO
devices...it won't be pretty.

Does it make more sense to kill off some brain cells on finding ways to
dynamically increase the stack as we run out? Or even give the robust stack
users like xfs/btrfs a way to say: I'm pretty sure this call path is going to
hurt, please make my stack bigger now.

We have relatively few entry points between the rest of the kernel and the FS,
there should be some ways to compromise here.

-chris
--

To: Chris Mason <chris.mason@...>, Christoph Hellwig <hch@...>
Cc: Arjan van de Ven <arjan@...>, T David Chinner <dgc@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, <Th@...>
Date: Thursday, April 24, 2008 - 2:30 pm

On Thu, 24 Apr 2008 11:41:30 -0400, "Chris Mason"

Hi,

(Rookie warning goes here.) To me, growing the stack at more or less
random places in the kernel seems to be quite a complicated thing to do
and it will be quite a maintainance burden to find the right spots to
insert stack usage checks. So I'ld say: lose the dynamic aspect.

How about unconditionally switching stacks at some defined points within
the core code of the kernel, just before calling into any driver code,
for example? The 4k-option has separate irq stacks already, why not have
driver stacks too?

I think the most important consideration to keep the stack size small
was that non-order-0 allocations are unreliable under/after memory
pressure due to fragmentation and that this allocation has to be done
for each thread. It is therefore preferable not to do any higher-order
allocations at all, unless there is a fall-back mechanism if the
allocation fails. For higher-order stacks there isn't such a fallback...
Can the system get by (without deadlocks at least in practice) with a
limited number of preallocated but 'large' stacks (in addition to a
small per-thread stack)?

It was discussed that stack space is needed for any sleeping process.
Could it be arranged that this waiting happens on the smallish stack, at
least for the most common cases, while non-waiting activity can use the
big stacks?

Greetings,
--
Alexander van Heukelum
heukelum@fastmail.fm

--
http://www.fastmail.fm - A fast, anti-spam email service.

--

To: Christoph Hellwig <hch@...>
Cc: Arjan van de Ven <arjan@...>, T David Chinner <dgc@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Thursday, April 24, 2008 - 8:25 am

That's basically any reclaim, even kswapd will ruin policy and block
allocation smarts.

--

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Saturday, April 19, 2008 - 10:35 am

Hi Ingo!

with the older kernel is typical: xfs+nfs+4k stack(+lvm)

--
Thanks,
Oliver
--

To: Oliver Pinter <oliver.pntr@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Saturday, April 19, 2008 - 11:19 am

Does anyone still experience problems with 2.6.25?

We all know that there once were problems, but if there are any left

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Saturday, April 19, 2008 - 11:42 am

I dont know, thet this problem presentiert in 2.6.25, but im older
kernels yes (2.6.22> or 2.6.23>).

--
Thanks,
Oliver
--

To: Adrian Bunk <bunk@...>
Cc: Oliver Pinter <oliver.pntr@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Saturday, April 19, 2008 - 9:56 pm

There are always problems. You can always come up with something that
will crash in 4k, IMHO.

Rather than foisting this upon everyone, I'd rather see work put into
making stack size a boot parameter or something, so that people can
choose what's appropriate for their workload (or their IO stack, if you
prefer).

--

To: Eric Sandeen <sandeen@...>
Cc: Oliver Pinter <oliver.pntr@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Sunday, April 20, 2008 - 3:42 am

We are going from 6k to 4k.

Your "You can always come up with something that will crash in" point
would be invariant to this change (although it might be harder to

Why should users have to poke with such deeply internal things?
That doesn't sound right.

Excessive stack usage in the kernel is considered to be a bug.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Eric Sandeen <sandeen@...>, Oliver Pinter <oliver.pntr@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Christoph Hellwig <hch@...>, David Chinner <dgc@...>, <xfs@...>
Date: Sunday, April 20, 2008 - 12:59 pm

let's see your patches then
--

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 11:29 pm

Really, not one?

https://bugzilla.redhat.com/show_bug.cgi?id=247158
https://bugzilla.redhat.com/show_bug.cgi?id=227331
https://bugzilla.redhat.com/show_bug.cgi?id=240077

(hehe, ok, xfs is a common component there...)

and it's not always obvious that you've overflowed the stack.

CONFIG_DEBUG_STACKOVERFLOW isn't ery useful because the warning printk

If Fedora is the common distro, ok. :)

Fedora is a pretty narrow sample in terms of IO stacks at least. I have
plenty of fondness for Fedora, but it's almost 100% ext3[1]. I spent a
fair amount of time getting xfs+lvm to survive 4k on F8; gcc caused
stack usage to grow in general from F7 to F8, and F9 seems to have
gotten tight again but I haven't gotten to the bottom of yet.

Heck my ext3-root-on-sda1 pre-beta F9 box, no nfs or lvm or xfs or
anything gets within 744 bytes of the end of the 4k stack simply by
*booting* (it was a modprobe process... maybe some module needs help)

How many other distros use 4K stacks on x86, really?

-Eric

[1] http://www.smolts.org/static/stats/stats.html shows 24588 ext3
filesystems, compared to 366 xfs, 248 reiserfs, 76 jfs ...
--

To: Eric Sandeen <sandeen@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 10:31 am

note that in -rt we have an ftrace plugin that measures _precise_ stack
footprint, when it happens.

so it's possible to measure exact stack footprint and save a stack trace
when that happens.

Ingo
--

To: Eric Sandeen <sandeen@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:36 am

That could be easily fixed by executing the printk on the interrupt
stack on i386. Currently it is before the stack switch which is wrong
agreed. On x86-64 it should already execute on the interrupt stack. Or
perhaps it would be better to just move the stack switch on i386 into
entry.S too similar to 64bit.

That wouldn't help without interrupt stacks of course, but these
should be always on anyways even with 8k stacks.

Experimental patch appended to do this.

-Andi

---

i386: Execute stack overflow warning on interrupt stack

Previously it would run on the process stack, which risks overflow
an already low stack. Instead execute it on the interrupt stack.

Based on an observation by Eric Sandeen.

Signed-off-by: Andi Kleen <andi@firstfloor.org>

Index: linux/arch/x86/kernel/irq_32.c
===================================================================
--- linux.orig/arch/x86/kernel/irq_32.c
+++ linux/arch/x86/kernel/irq_32.c
@@ -61,6 +61,26 @@ static union irq_ctx *hardirq_ctx[NR_CPU
static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly;
#endif

+static void stack_overflow(void)
+{
+ printk("low stack detected by irq handler\n");
+ dump_stack();
+}
+
+static inline void call_on_stack2(void *func, unsigned long stack,
+ unsigned long arg1, unsigned long arg2)
+{
+ unsigned long bx;
+ asm volatile(
+ " xchgl %%ebx,%%esp \n"
+ " call *%%edi \n"
+ " movl %%ebx,%%esp \n"
+ : "=a" (arg1), "=d" (arg2), "=b" (bx)
+ : "0" (arg1), "1" (arg2), "2" (stack),
+ "D" (func)
+ : "memory", "cc");
+}
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -76,6 +96,7 @@ unsigned int do_IRQ(struct pt_regs *regs
union irq_ctx *curctx, *irqctx;
u32 *isp;
#endif
+ int overflow = 0;

if (unlikely((unsigned)irq >= NR_IRQS)) {
printk(KERN_EMERG "%s: cannot handle IRQ %d\n",
@@ -92,11 +113,8 @@ unsigned int do_IRQ(struct pt_r...

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 10:59 am

If by other distros you mean RHEL then yes. However, openSUSE,
Ubuntu, and Mandriva all still have 8K stacks. I know of no other
distributions that default to 4K.

--
Shawn
--

To: Shawn Bohrer <shawn.bohrer@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:09 am

MontaVista offers 4k stacks for arm (currently an external patch) and
markets that as a feature to customers, so many of them might use it.

In-kernel the sh and m68knommu ports also offer 4k stacks (for both
archs there's also a defconfig using it), and the mn10300 port contains
an #ifdef but no config option.

The stack problems in the kernel tend to not be in arch code, and if
we don't get i386 to always run with 4k stacks there's no chance that

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:06 am

Not really the case - embedded tends not to use deep stacks of drivers.

Alan
--

To: Alan Cox <alan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:51 am

Something like nfsd-over-xfs-over-raid is (or was) the most common

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 5:36 am

On Sun, 20 Apr 2008 11:51:04 +0300

Specific cases yes, but such NAS devices have big processors and are not
little emdedded CPUs. On an embedded box you know at build time what it
will be doing.
--

To: Alan Cox <alan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 6:44 am

The code in the kernel that gets the fewest coverage at all are our
error paths, and some vendor might try 4k stacks, validate it works in
all use cases - and then it will blow up in some error condition he
didn't test.

6k is known to work, and there aren't many problems known with 4k.

And from a QA point of view the only way of getting 4k thoroughly tested
by users, and well also tested in -rc kernels for catching regressions
before they get into stable kernels, is if we get 4k stacks enabled
unconditionally on i386.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:22 am

..

That's exactly the worry.

If anyone want's to take a crack at testing some of the more likely
fail paths there, just introduce a media error onto a SATA disk
that's buried at the bottom of a stacked RAID1 over RAID0 over LVM,
with XFS and nfsd on top.

Or something like that.
And then experiment with corrupting meta data rather than simply file data.
How-to introduce a media error? hdparm --make-bad-sector nnnnnn /dev/sdX

This catches the most likely (IMHO) failure scenarios,
but still comes nowhere near 100% code coverage. :(

Cheers
--

To: Adrian Bunk <bunk@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:27 am

But you have to first ask why do you want 4k tested? Does it serve
any useful purpose in itself? I don't think so. Or you're saying
it's important to support 50k kernel threads on 32bit kernels?

-Andi
--

To: Andi Kleen <andi@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 11:44 am

Andi, you're the only one I've seen seriously pounding the "50k threads"
thing - I don't think anyone is really fooled by the straw-man, so I'd
suggest you drop it.

The real issue is that you think (and are correct in thinking) that people are
idiots. Yes, there will be breakages if the default is changed to 4k stacks -
but if people are running new kernels on boxes that'll hit stack use problems
(that *AREN'T* related to ndiswrapper) and haven't made sure that they've
configured the kernel properly, then they deserve the outcome. It isn't the
job of the Linux Kernel to protect the incompetent - nor is it the job of
linux kernel developers to do such.

If people are doing a "zcat /proc/kconfig.gz > .config && make oldconfig" (or
similar) the problem shouldn't even appear, really. They'll get whatever
setting was in their old config for the stack size. And until the problems
with deep-stack setups - like nfs+xfs+raid - get resolved I'd think that the
option to configure the stack size would remain.

Since the second-most-common reason for stack overages is ndiswrapper... Well,
with there being so much more hardware now supported directly by the linux
kernel... I'm stunned every time someone tells me "I can't run Linux on my
laptop, there is hardware that isn't supported without me having to get
ndiswrapper". The last time someone said that to me I pointed to the fact
that their hardware is supported by the latest kernel and even offered to
build&install it for them.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--

To: Daniel Hazelton <dhazelton@...>
Cc: linux-kernel <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 2:20 pm

How would I like you being right... Atheros AR5008, AR5414 PHY, "not yet
here". It's almost one year now since I bought this laptop, and till now
it's the cable or ndiswrapper. But yes, it's going better. For my first
wifi laptop I waited two and a half years, now it seems that in a bit
more than one there will be an open source driver...

I know all the trouble ndiswrapper signify. But I see also that people
around me with a laptop and linux use more ndiswrapper than a real
driver, so... be gentle with it.

Thanks,
Romano

--
Sorry for the disclaimer --- ¡I cannot stop it!

--
La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración.

This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation.
--

To: Romano Giannetti <romanol@...>
Cc: Daniel Hazelton <dhazelton@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 1:03 am

Nobody knows how much potential development is not done because
"you can make your wifi work with ndiswrapper".
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Romano Giannetti <romanol@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 1:21 am

I've got to agree with that sentiment. Once a working solution is found, no
matter how crappy, it seems that almost all development stops.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--

To: Daniel Hazelton <dhazelton@...>
Cc: Denys Vlasenko <vda.linux@...>, Romano Giannetti <romanol@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 1:25 am

and nobody knows how many people are running linux instead of windows
becouse they were able to use ndiswrapper to get things running. most of
those people contributed nothing to the kernel, but they all contributed
to Linux, if nothing else as examples that Linux is a reasonable option
(and some percentage of those users have contrinbuted to other opensource
projects that they would probably never have bumped into if they were
running windows instead)

I know we will never convince each other, but we do need to recognise that
there is another valid point of view.

David Lang
--

To: <david@...>
Cc: Denys Vlasenko <vda.linux@...>, Romano Giannetti <romanol@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 1:41 am

And who knows how many more people would be running Linux if they didn't need
ndiswrapper at all?

And how much better would it be if the drivers were native linux code and were
fully supportable because of that?

There are many, many reasons why it'd be better if ndiswrapper didn't exist as
a solution or if development on native solutions continued on at the level it
would without ndiswrapper.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--

To: Daniel Hazelton <dhazelton@...>
Cc: <david@...>, Denys Vlasenko <vda.linux@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 3:46 am

[Trimmed, I hope I got the authors right...]

I understand your position, but let me give my example. I have this
laptop that is one year old. I'm helping in all what I can to the
development of ath5k --- IOW, offering testing, I am not an expert on
this.

But the mere fact that ndiswrapper exists enabled me to use this laptop
on a daily basis, and so I could test new kernel (and if you look at the
logs you'll see I had at least helped to fix a nasty MMC bug, and to
make sound work in this laptop) and help in other areas, like
suspend/resume testing and bug chasing.

There is not only wireless development. Without ndiswrapper, I wouldn't
have been in any position to help other areas. I would have had a
crippled laptop[1], a much higher Vista uptime (which now is 0), and a
far bitter Linux experience.

And this is the point of view of someone that is using Linux since
0.99pl9, so I have quite a bit of experience. 99% of normal users would
simply say "don't work"[2].

Romano

[1] yes, there's a madwifi version locked to a specific kernel that
works with my card. But I do not think that this would be so much
different.

[2] a nice page with "_this_ laptop will fully work with linux" would be
nice. Linux on laptop or similar is too complex to be a real help when
you have to buy a laptop in 2 days.

--
Sorry for the disclaimer --- ¡I cannot stop it!

--
La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración.

This communication contains confidential information. It is for the exclusi...

To: Romano Giannetti <romanol@...>
Cc: Daniel Hazelton <dhazelton@...>, <david@...>, Denys Vlasenko <vda.linux@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 7:24 am

Romano Giannetti wrote:

If sites like tuxmobil.org, hardware4linux.info, and the hardware
compatibility databases of Linux distributors don't work for you, then
just ask the notebook vendors directly.
--
Stefan Richter
-=====-==--- -=-- =-===
http://arcgraph.de/sr/
--

To: Stefan Richter <stefanr@...>
Cc: Daniel Hazelton <dhazelton@...>, <david@...>, Denys Vlasenko <vda.linux@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 8:15 am

Unfortunately, it is quite a complex thing to check. Mind you, I've
bought this laptop after looking all over there, but:

- tuxmobil & Co are very user-driven, and you have to swim among tenth
of "similar" computer;

- it's not so easy to know what exactly is bundled with a laptop[1];

- vendor say "works" (and often is listed as works in the aforementioned
sites too) independently if it works with an open source driver or not.
As an example, all the nvidia-based graphics are marked "works".

Romano

[1] In my case, I selected this toshiba over for example a HP or an Acer
because it had "atheros wifi" (but guess that the PHY version is too new
to be supported...), "intel hda sound" (but guess that the specific
codec didn't work at all, and continues to have a lot of problems),
"intel graphics" (and that at least was a good decision!).

--
Sorry for the disclaimer --- ¡I cannot stop it!

--
La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración.

This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation.
--

To: Romano Giannetti <romanol@...>
Cc: Stefan Richter <stefanr@...>, Daniel Hazelton <dhazelton@...>, <david@...>, Denys Vlasenko <vda.linux@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, April 23, 2008 - 11:59 am

The nv driver does work for all the nvidia cards as far as I know. Sure
you don't get 3D acceleration, but you do get working X.

But yes it is quite annoying when companies like highpoint (and others)
claim to support linux when all they have is binary blobs as part of their
"driver".

--
Len Sorensen
--

To: Daniel Hazelton <dhazelton@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 1:26 pm

Ok, perhaps we can settle this properly. Like historicans. We study the
original sources.

The primary resource is the original commit adding the 4k stack code.
You cannot find this in latest git because it predates 2.6.12, but it is
available in one of the historic trees imported from BitKeeper like
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

commit 95f238eac82907c4ccbc301cd5788e67db0715ce
Author: Andrew Morton <akpm@osdl.org>
Date: Sun Apr 11 23:18:43 2004 -0700

[PATCH] ia32: 4Kb stacks (and irqstacks) patch

From: Arjan van de Ven <arjanv@redhat.com>

Below is a patch to enable 4Kb stacks for x86. The goal of this is to

1) Reduce footprint per thread so that systems can run many more threads
(for the java people)

2) Reduce the pressure on the VM for order > 0 allocations. We see
real life
workloads (granted with 2.4 but the fundamental fragmentation
issue isn't
solved in 2.6 and isn't solvable in theory) where this can be a
problem.
In addition order > 0 allocations can make the VM "stutter" and
give more
latency due to having to do much much more work trying to defragment

...
<<

This gives us two reasons as you can see, one of them many threads
and another mostly only relevant to 2.4

Now I was also assuming that nobody took (1) really serious and
attacked (2) in earlier thread; in particular in

Actually the real reason the 4K stacks were introduced IIRC was that
the VM is not very good at allocation of order > 0 pages and that only
using order 0 and not order 1 in normal operation prevented some stalls.

This rationale also goes back to 2.4 (especially some of the early 2.4
VMs were not very good) and the 2.6 VM is generally better and on
x86-64 I don't see much evidence that these stalls are a big problem
(but then x86-64 also has more lowmem).
<<

This was corrected by Ingo who was one of the primary authors of the patch:

no, the primary ...

To: Andi Kleen <andi@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 2:48 pm

On Sun, 20 Apr 2008 19:26:10 +0200

I'm sorry but I really hope nobody shares your assumption here.
These are real customer workloads; java based "many things going on" at a time
showed several thousands of threads fin the system (a dozen or two per request, multiplied
by the number of outstanding connections) for *real customers*.

yes you did attack. But lets please use more friendly conversation here than words like
"attack". This is not a war, and we really shouldn't be hostile in this forum, neither

What you didn't atta^Waddress was the observation that fragmentation is fundamentally unsolvable.
Yes 2.4 sucked a lot more than 2.6 does. But even 2.6 will (and does) have fragmentation issues.

I'm sorry but I fail to entirely understand where your "So" or the rest of your
conclusion comes from in terms of "both the authors". Which part of "fewer threads" and
"8kb versus fragmentation" did you misunderstand to get to your conclusion?

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: <andi@...>, <dhazelton@...>, <bunk@...>, <alan@...>, <shawn.bohrer@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Sunday, April 20, 2008 - 5:45 pm

Lumpy reclaim is supposed to be exactly that.
--

To: Andrew Morton <akpm@...>
Cc: Arjan van de Ven <arjan@...>, <dhazelton@...>, <bunk@...>, <alan@...>, <shawn.bohrer@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Sunday, April 20, 2008 - 5:51 pm

Also if order 1 allocs were a significant problem on i386 we must have
had lots of reports of EAGAIN on fork/clone with !4k stack kernels. I'm
not aware of an significant number of such reports (there were a few
occasionally, but that is probably normal and unavoidable and can
be caused by other things too like simply running out of lowmem)

-Andi

--

To: Arjan van de Ven <arjan@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:01 pm

Several thousands or 50k? Several thousands sounds large, but not entirely unreasonable,

No I don't take 50k threads on 32bit serious. And I hope you do not
either.

Why I don't take it serious: on 32bit 50k threads will lead
to lowmem exhaustion if the threads are actually doing something
(like keeping select pages around or similar and having some thread
local data). You'll easily be at 16-32K/thread and that is already
far beyond the lowmem available on any 3:1 split 32bit kernel, likely
even beyond 2:2. Even with 3:1 it could be tight.

So you can say about customer workloads what you want, but you'll
have a hard time convincing me they really run 50k threads
doing something on 32bit.

Now if we take the real realistic overhead of a thread into
account 4k or more less don't really matter all that much
and the decreased safety from the 4k stack starts to look

Ok what word would you prefer?

There is no war involved right, just a technical argument. I previously
always assumed that "attacking" was a standard term in discussions, but
if you don't like I can switch to another one.

Regarding war like terminology: I used to think that people who commonly
talk about "nuking code" went a little too far, but at some point

I don't see any evidence that there are serious order 1 fragmentation
issues on 2.6. If you have any please post it.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 6:33 pm

On Sun, 20 Apr 2008 22:01:46 +0200

it is you who keeps putting up the 50k argument.
What I'm talking about is in the 10k to 20k range; and that is actual workloads

it was in the commit message from me you quoted, and was rather widely discussed at the time.
It's also basic math; the Linux VM gets to deal with both short and long lasting allocations;
no matter how hard you try to get some degree of fragmentation; especially due to the
15:1 acceleration you get due to the lowmem issue.

And before you say "you should use 64 bit on such machines"; I would love it if more people used 64 bit linux.

just like you're posting the evidence that 4k stacks overflows?

Google scores:

1-order allocation failed 54000 pages
do_IRQ: stack overflow 4560 pages

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Andi Kleen <andi@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 6:33 pm

On Sun, 20 Apr 2008 22:01:46 +0200

it is you who keeps putting up the 50k argument.
What I'm talking about is in the 10k to 20k range; and that is actual workloads

it was in the commit message from me you quoted, and was rather widely discussed at the time.
It's also basic math; the Linux VM gets to deal with both short and long lasting allocations;
no matter how hard you try to get some degree of fragmentation; especially due to the
15:1 acceleration you get due to the lowmem issue.

And before you say "you should use 64 bit on such machines"; I would love it if more people used 64 bit linux.

just like you're posting the evidence that 4k stacks overflows?

Google scores:

1-order allocation failed 54000 pages
do_IRQ: stack overflow 4560 pages

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Andi Kleen <andi@...>, Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 11:06 pm

with quotes for exact matches:

"1-order allocation failed" 790 pages
"do_IRQ: stack overflow" 1,880 pages

http://www.google.com/search?q=%221-order+allocation+failed%22
http://www.google.com/search?q=%22do_IRQ%3A+stack+overflow%22

-Eric
--

To: Arjan van de Ven <arjan@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 7:16 pm

See the links I posted and quote in an earlier message up the thread if you
don't remember what you wrote yourself.

I originally only hold up the fragmentation argument (or rather only
argued against it), until I was corrected by both Ingo and you in the
earlier thread and you both insisted that 50k threads were the real
reason'd'etre for 4k stacks.

You're saying that was wrong and the fragmentation issue was really the
real reason for 4k stacks? If both you and Ingo can agree on that

On a 32bit kernel?

My estimate is that you need around 32k for a functional blocked thread
in a network server (8k + 2*4k for poll with large fd table and wait queues +
some pinned dentries and inodes + misc other stuff). With 20k you're 625MB into
your lowmem which leaves about 200MB left on a 3:1 system with 16GB
(and ~128MB mem_map). That might work for some time, but I expect it will fall
over at some point because there is just too much pinned lowmem
and not enough left for other stuff (like networking buffers etc.)

10k sounds more doable. But again do 4k more or less make
a big difference with the other thread overhead? I don't think so.

And trading reliability (and functionality -- you basically have to
cut off XFS)just for 4k/thread doesn't seem like good bargain to

Well if it is that serious a problem surely it will have hit some public
bugzillas or mailing lists? Arguing with something secret is also not
very useful.

Also I find it always important to reevaluate assumptions when new
facts come up. In this case we should reevaluate a decision that made
sense[1] in 2.4 with the new facts of 2.6 (e.g. new VM with much better
reclaim)

[1] refering to the fragmentation argument, not the 50k threads which
were always unrealistic.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Daniel Hazelton <dhazelton@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 1:53 am

On Mon, 21 Apr 2008 01:16:22 +0200

no but the other ones are order 0..

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Andi Kleen <andi@...>
Cc: Arjan van de Ven <arjan@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:43 pm

At 12 threads per request it'd only take about 4200 outstanding requests. That
is high, but I can see it happening. At 24 threads per request the number of
outstanding requests it takes to reach that is cut in half, to about 2100.
That number is more realistic. Since all outstanding requests aren't going to
be at the extremes, let us assume that it's a mid-point between the two for
the number of outstanding requests - say somewhere around 3150 outstanding
requests.

While that is a rather high number, if a company - a decently sized one - is
using a piece of Java code internally for some reason they could easily have
that level of requests coming in from the users. For a website with a decent
load that routes a common request to the machine running the code it'd be
even easier to hit that limit. So yes, 50K threads *IS* actually pretty easy

Just makes you sound foolish. Run the numbers yourself and you'll see that it
is easy for a machine running highly threaded code to easily hit 50K threads.

Due to me screwing up the configuration of Apache (2) and MySQL I have seen a
machine I own hit problems with memory fragmentation - and it's running a 2.6
series kernel (a distro 2.6.17)

Because I was able to see that it was a problem I caused I didn't even *THINK*
about posting information about it to LKML. I didn't keep the logs of that
around - it happened more than three months ago and I clean the logs out
every three months or so.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--

To: Daniel Hazelton <dhazelton@...>
Cc: Arjan van de Ven <arjan@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 5:40 pm

I ran the numbers and the numbers showed that you need > 1.5GB of lowmem
with a somewhat realistic scenario (32K per thread) at 50k threads. And
subtracting 4k from that 32k number won't make any significant
difference (still 1.3GB)

If you claim that works on a 32bit system with typically 300-600MB
lowmem available (which is also shared by other subsystem) I know who
sounds foolish.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Arjan van de Ven <arjan@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:45 pm

No, it won't. Which is what I was pointing out. You're hitting a different

Never said it worked on a 32bit system. I was pointing out that there can be
workloads that do reach that 50K thread-count that you seem to be
calling "stupid".

As I pointed out later in the message, I *HAVE* run into lowmem starvation on
a 32bit x86 system. You thoughtfully removed this, perhaps because you felt
it damaged your argument. The machine in question is an old P3 box with less
than 1G of memory in it. (Phys+Swap on that machine is only about 1.4G)

So yes, on a 32bit machine you run into problems at much, much less of a
workload and a much lower thread-count than the magic 50K you are so fond of
talking about. If I had been running 4K stacks on that machine I probably
would have survived the mis-configuration without the reboot it took to make
the machine functional again - I probably would still have reconfigured
Apache and MySQL, though - the machine still would have gone largely
unresponsive.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--

To: Daniel Hazelton <dhazelton@...>
Cc: Andi Kleen <andi@...>, Arjan van de Ven <arjan@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 3:51 am

Ah your point was that people might do this on 64bit systems?

They could indeed. It would not be very efficient but it should work
in theory at least with enough memory. Of course they don't need 4k
stacks for it. They can also try it on 32bit and it will work
to some extent too, just not scale very far. And 4k stack more or less
won't make much difference for that because the stack is only
a small part of the lowmem needed for a blocked thread with
open sockets.

Note I didn't come up with that number, it was quoted to me earlier
(but one of its authors has distanced itself from it now, so it
seems to becoming more and more irrelevant indeed now)

Stupid in this case just refers to the general observation that
it is quite inefficient to do one thread per request on servers
who are expected to process lots of long running connections.

Perhaps I could have put that better I will give you that. Please

Now that is a very doubtful claim. You realize that a functional network
server thread needs a lot more lowmem than just the stack?

-Andi

--

To: Andi Kleen <andi@...>
Cc: Arjan van de Ven <arjan@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 1:34 pm

My point was that people might try to make such a system work on a 32bit
system and fail. The fact that the limit does exist and changing the stack
size doesn't really help things is a key there.

My point is that you can get a few more threads out of a machine with 4K
stacks, even on 32bit. Sure, the difference is basically negligible, but it
does happen. That extra available space may be the difference between a
poorly coded program triggering random crashes (and the OOM killer) and the
system surviving it.

While it's true that I feel that the job of the kernel isn't to protect the
incompetent, it should protect the competent admins from the incompetent

True. But having that tiny bit of extra memory might be the difference between

I didn't say otherwise. I was pointing out that 50K threads isn't out of the
question when looking at the workload provided (and ignoring all other memory
concerns.

However, I had hoped I wouldn't have to spell out the stuff I've had to point

Yes, I know you didn't come up with it. But in seeing the original commit-log
for it, I'm thinking that the '50K' number was initially meant as either a

Remember, you're talking about people that write the code in Java. It's going
to spawn all kinds of threads anyway. I, personally, would write the code in
a language giving me better control over the available resources. However,
I'm not employed by any major company because I will almost always refuse to

There was nothing else running on the machine and it was reporting lowmem free
in the logs, just none "usable". Since the two biggest hogs on that box are
Apache2 and MySQL - and since repairing the Apache2 config damage has halted
further OOM's on that machine, I'm pretty much certain that it was Apache2 at
fault, though since there were reports of free lowmem, I'm pretty certain it
was a combination of fragmentation and Apache2.

DRH

--
Dialup is like pissing through a pipette. Slow and excruciatingly painful.
--
...

To: <linux-kernel@...>
Date: Sunday, April 20, 2008 - 6:17 pm

A question along this line. Why is the Userspace Thread bound to a
Kernel-Space Stack at all? I could imagine a solution like Stack Pools
assigned only of a Thread enters kernel space, or something like this?

Gruss
Bernd
--

To: Bernd Eckenfels <ecki@...>
Cc: <linux-kernel@...>
Date: Sunday, April 20, 2008 - 7:48 pm

The vast majority of threads are sleeping (with a stack footprint in the
kernel). If you have an N-way system, at most N threads can be in
userspace at any given moment.

You could multiplex several userspace threads on one kernel thread (the
M:N model), but it gets fairly complex.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

To: Andi Kleen <andi@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:47 am

Clearly if I have the choice between a kernel which can run 50k threads
and a kernel which does not crash under me during an I/O error, I choose
the later! I don't even imagine what purpose 50k kernel threads may serve.
I certainly can understand that reducing memory footprint is useful, but
if we want wider testing of 4k stacks, considering they may fail in error
path in complex I/O environment, it's not likely during -rc kernels that
we'll detect problems, and if we push them down the throat of users in a
stable release, of course they will thank us very much for crashing their
NFS servers in production during peak hours.

I have nothing against changing the default setting to 4k provided that
it is easy to get back to the save setting (ie changing a config option,
or better, a cmdline parameter). I just don't agree with the idea of
forcing users to swim in the sh*t, it only brings bad reputation to
Linux.

What would really help would be to have 8k stacks with the lower page
causing a fault and print a stack trace upon first access. That way,
the safe setting would still report us useful information without
putting users into trouble.

Willy

--

To: Willy Tarreau <w@...>
Cc: Andi Kleen <andi@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:27 am

..

That's the best suggestion from this thread, by far!
Can you produce a patch for 2.6.26 for this?
Or perhaps someone else here, with the right code familiarity, could?

Some sort of CONFIG option would likely be wanted to
either enable/disable this feature, of course.

Cheers
--

To: Mark Lord <lkml@...>
Cc: Willy Tarreau <w@...>, Andi Kleen <andi@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:09 am

Changing the default warning threshold is easy, it's just a #define.
Although setting it too low would spam syslogs on some setups.

When I was trying to cram stuff into 4k in the past, I had a patch which
added a sysctl to dynamically change the warning threshold, and
optionally BUG() when I hit it for crash analysis. It was good for
debugging, at least. If something along those lines is desired, I could
resurrect it.

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Mark Lord <lkml@...>, Andi Kleen <andi@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:20 am

I thought it was checked only at a few places (eg: during irqs). If so,

we should set it slightly below the 4k limit if we want users to switch

While it's good for debugging, having users tweak the limit to eliminate
the warning is the opposite of what we're looking for. We just want to
have them report the warning without their service being disrupted.

Willy

--

To: Willy Tarreau <w@...>
Cc: Mark Lord <lkml@...>, Andi Kleen <andi@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:40 am

Ah, ok I skimmed your first suggestion too quickly. 100% coverage
reports on the initial access to the 2nd 4k that way would be nice.
Well, it would be nice if we all really wanted 4k stacks some day... :)

-Eric
--

To: Mark Lord <lkml@...>
Cc: Andi Kleen <andi@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:38 am

If we want to migrate to 4k sooner or later, this behaviour would not
need a config option, maybe just a /proc or /sys tunable to disable
the warning. Config would be either (4k + risk of crash) or (8k + warning).

The *real* issue is to decide whether we need/want 4k or not, because
I think we're still discussing the subject for no reason, as usual...

Willy

--

To: Willy Tarreau <w@...>
Cc: Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:19 am

Only if you believe that 4K stack pages are a worthy goal.
As far as I can figure out they are not. They might have been
a worthy goal on crappy 2.4 VMs, but these times are long gone.

The "saving memory on embedded" argument also does not
quite convince me, it is unclear if that is really
a significant amount of memory on these systems and if that
couldn't be addressed better (e.g. in running generally
less kernel threads). I don't have numbers on this,
but then the people who made this argument didn't have any
either :)

If anybody has concrete statistics on this
(including other kernel memory users in realistic situations)

The problem with his suggestion is that the lower 4K of the stack page
are accessed in normal operation too because it contains the thread_struct.
That could be changed, but it would be a relatively large change
because you would need to audit/change a lot of code who assumes
thread_struct and stack are continuous

If that was changed implementing Willy's suggestion would not be that
difficult using cpa() at the cost of some general slowdown in
increased TLB misses and much higher thread creation/tear down cost etc,
Using the alternative vmalloc way has also other issues.

But still the fundamental problem is that it would likely only
hit the interesting cases in real production setups and I don't
think the production users would be very happy to slow down
their kernels and handle strange backtraces just to act as guinea pigs
for something dubious

-Andi

--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 12:41 pm

It is not uncommon for embedded systems to be designed around 16MiB.
Some may even have less, although I haven't encountered any of those
lately.

When dealing in those dimensions, savings of 100k are substantial. In
some causes they may be the difference between 16MiB or 32MiB, which
translates to manufacturing costs. In others it simply means that the
system can cache a bit more and run faster, or it can have a little more
functionality.

In most cases it simply allows userspace programmers to avoid looking
harder to save those 100k, as they are already saved in kernel space.
Therefore we made life hard for us in order to make life easier for
someone else, saving them time and money.

Whether that is worth it depends on your personal point of view. Many
embedded people will claim "Hell yes!" Of those that don't, most are
simply ignoring currently mainline kernels and will regret the
development later. They care, thay just don't tend to care enough to
engage in these discussions or even know about them. :(

Jörn

--
Eighty percent of success is showing up.
-- Woody Allen
--

To: Jörn Engel <joern@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 1:19 pm

But these are SoC systems. Do they really run x86?
(note we're talking about an x86 default option here)

Also I suspect in a true 16MB system you have to strip down
everything kernel side so much that you're pretty much outside

If you need the stack you don't have any less cache foot print.
If you don't need it you don't have any either.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 1:43 pm

Maybe. I merely showed that embedded people (not me) have good reasons
to care about small stacks. Whether they care enough to actually spend

This part I don't understand.

Jörn

--
You ain't got no problem, Jules. I'm on the motherfucker. Go back in
there, chill them niggers out and wait for the Wolf, who should be
coming directly.
-- Marsellus Wallace
--

To: Jörn Engel <joern@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 2:19 pm

Sure but I don't think they're x86 embedded people. Right now there
are very little x86 SOCs if any (iirc there is only some obscure rise
core) and future SOCs will likely have more RAM.

Anyways I don't have a problem to give these people any special options
they need to do whatever they want. I just object to changing the
default options on important architectures to force people in completely
different setups to do part of their testing.

I was just objecting to your claim that small stack implies smaller
cache foot print. Smaller stacks rarely give you smaller cache foot
print in my kernel coding experience:

First some stack is always safety and in practice unused. It won't
be in cache.

Then typically standard kernel stack pigs are just too large
buffers on the stack which are not fully used. These also
don't have much cache foot print.

Or if you have a complicated call stack the typical fix
is to move parts of it into another thread. But that doesn't
give you less cache footprint because the cache foot print
is just in someone else's stack. In fact you'll likely
have slightly more cache foot print from that due to the
context of the other thread.

In theory if you e.g. convert a recursive algorithm
to iterative you might save some cache foot print, but I don't
think that really happens in kernel code.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:35 pm

Ah, ok. The question whether 4k stacks should become the default I
prefer not touching with an 80' pole.

Jörn

--
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons. It's my life.
-- Charles Shultz
--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:32 pm

The cache I referred to is called DRAM, not L1.

Jörn

--
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
Raph Levien, 1979
--

To: Andi Kleen <andi@...>
Cc: Jörn <joern@...>, Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 2:50 pm

On Sun, 20 Apr 2008 20:19:30 +0200

this is what Al did for the symlink recursion thing, and Jens did for the block layer...
so yes this conversion does happen for real.

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: <andi@...>, <joern@...>, <w@...>, <lkml@...>, <bunk@...>, <alan@...>, <shawn.bohrer@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Sunday, April 20, 2008 - 5:50 pm

md got mostly-fixed too, via Neil's patch which sat in -mm for nearly two
years.

--

To: Andrew Morton <akpm@...>
Cc: Arjan van de Ven <arjan@...>, <andi@...>, <joern@...>, <w@...>, <lkml@...>, <bunk@...>, <alan@...>, <shawn.bohrer@...>, <linux-kernel@...>, <tglx@...>
Date: Monday, April 21, 2008 - 10:29 am

had we done the de-obfuscate-4K-stacks Kconfig change earlier it might
have gotten upstream faster.

Ingo
--

To: Andrew Morton <akpm@...>
Cc: Arjan van de Ven <arjan@...>, <joern@...>, <w@...>, <lkml@...>, <bunk@...>, <alan@...>, <shawn.bohrer@...>, <mingo@...>, <linux-kernel@...>, <tglx@...>
Date: Sunday, April 20, 2008 - 5:55 pm

Congratulations, you found three examples in 8.4MLOC.

Ok ok I should have said it only happens very rarely (I still stand by
that :)

Anyways it is moot because it was a miscommunication between me and Joerg.

-Andi

--

To: Arjan van de Ven <arjan@...>
Cc: Jörn Engel <joern@...>, Willy Tarreau <w@...>, Mark Lord <lkml@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 4:09 pm

AFAIK most symlink lookups are still recursive.

-Andi
--

To: Willy Tarreau <w@...>
Cc: Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:21 am

I've seen many bugs in error paths in the kernel and fixed quite a
few of them - and stack problems were not a significant part of them.

There are so many possible bugs (that also occur in practice) that

What actually brings bad reputation is shipping a 4k option that is
known to break under some circumstances.

And history has shown that as long as 8k stacks are available on i386
some problems will not get fixed. 4k stacks are available as an option
on i386 for more than 4 years, and at about as long we know that there
are some setups (AFAIK all that might still be present seem to include
XFS) that are known to not work reliably with 4k stacks.

If we go after stability and reputation, we have to make a decision
whether we want to get 4k stacks on 32bit architectures with 4k page
size unconditionally or not at all. That's the way that gets the maximal
number of bugs shaken out [1] for all supported configurations before

cu
Adrian

[1] obviously not all, but that's true for all classes of bugs

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Willy Tarreau <w@...>, Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 28, 2008 - 2:38 pm

A good argument for keeping the default 8k and letting people who know
what they are doing, or think they do, test their system for 4k
operation. Embedded systems typically have far better defined loads than
servers or desktops, and are less likely to have different behavior
change the stack requirements. That doesn't mean they do less, just that
the load is usually better characterized.

Vendors shipping a 4k stack kernel are probably not going to be happy if
someone nfs exports an xfs filesystem on lvm, running on md raid0
composed of raid5 arrays, containing multipath, iSCSI, SATA and nbd
devices. No, I didn't make that up, someone asked me what I thought
their problem was with that setup.

The kernel is getting more complex, and I don't think that anyone but
you is interested in making 4k stacks mandatory, or in eliminating them,
either.

You frequently take the attitude that something you don't like (like all
the old but WORKING network drivers) should be removed from the kernel,
so that people will be forced to use the new whatever and find bugs so
they can be fixed. Unfortunately in some cases the bugs are never fixed
and Linux loses a capability it once had.

The arbitrary 4k limit requires a lot of work on dropping stack usage
even more than has already been done, and is mostly an effort you want
other people to make so you can be happy (I assume that if you were
offering to do it all yourself you already would have), and most
importantly it would waste a lot of developer effort on a low return
goal, which could be used on useful new features or fixing corner case
bugs. Or drinking beer...

Hell, it wastes your time arguing about it, and you do lots of useful
things when you're not trying to force your minimalist philosophy on people.

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--

To: Adrian Bunk <bunk@...>
Cc: Willy Tarreau <w@...>, Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 5:13 am

How about making 4k stacks incompatible with those circumstances then?
I.e. is you select 4k stacks, then you can't select XFS because we know
that _may_ fail. Similiar for ndiswrapper networking, and other
stuff where problems have been noticed.

Some people don't need any of these, and can then use
safe 4k stacks. Well, at least as safe as the 8k stacks are, there is no
mathematical proof for their safety in all cases either.

Helge Hafting
--

To: Helge Hafting <helge.hafting@...>
Cc: Adrian Bunk <bunk@...>, Willy Tarreau <w@...>, Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Thursday, April 24, 2008 - 11:46 am

Problem is, it's the storage configuration (at administration time, not
kernel build time) that matters, too.

I have XFS on Fedora with 4k stacks on SATA /dev/sdb1 on my x86 mythbox,
and it's perfectly fine. But that's a nice, simple setup. If I stacked
more things over/under it, I'd be more likely to have trouble.

-Eric
--

To: Helge Hafting <helge.hafting@...>
Cc: Adrian Bunk <bunk@...>, Willy Tarreau <w@...>, Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 7:29 pm

Yeah, that means every distro that supports XFS (i.e. pretty much
all of them including Fedora) will be forced disable 4k stacks on
x86. I'd be happy with this solution.

FWIW, this would make 4k stacks pretty much unused outside of custom
kernels. At which point I'd suggest a default of 4k is wrong....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Willy Tarreau <w@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:06 am

I don't know either but it was quoted to me earlier as the primary

So you're saying that only advanced users who understand all their
CONFIG options should have the safe settings? And everyone else
the "only explodes once a week" mode?

For me that is exactly the wrong way around.

If someone is sure they know what they're doing they can set whatever
crazy settings they want (given there is a quick way to check
for the crazy settings in oops reports so that I can ignore those), but
the default should be always safe and optimized for reliability.

-Andi

--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:30 am

That means we'll have nearly zero testing of the "crazy setting" and
when someone tries it he'll have a high probability of running into some
problems.

Such a "crazy setting" shouldn't be offered to users at all.

We should either aim at 4k stacks unconditionally for all 32bit
architectures with 4k page size or don't allow any architecture

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: <linux-kernel@...>
Cc: Andi Kleen <andi@...>, Willy Tarreau <w@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 28, 2008 - 1:56 pm

I have suggested before that the solution is to allocate memory in
"stack size" units (obviously must be a multiple of the hardware page
size). The reason allocation fails is more often fragmentation than
actual lack of memory, or so it has been reported.

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

--

To: Adrian Bunk <bunk@...>
Cc: Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 9:34 am

I agree you make a valid point here. Then wouldn't it be easier to
simply remove 4k and agree it was a wet dream ?

Willy

--

To: Willy Tarreau <w@...>
Cc: Andi Kleen <andi@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:04 am

If the sh maintainer and the m68knommu maintainer (and perhaps

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Andi Kleen <andi@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:32 am

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 7:02 am

Which you won't fix by changing the x86 defaults. More of a problem in
embedded small devices is the 8K allocation failing in the first place -

At which point some distros will simply patch it back no doubt.
--

To: Alan Cox <alan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 7:54 am

Stuff like nfsd, xfs and raid is covered by the x86 defaults.

Red Hat seems to get usable kernels with 4k for some years?

If we get whatever is still missing for 4k working once and then the
coverage of all i386 -rc testers for noticing new issues immediately
there should be no stability reason for distros to patch it back in.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:37 am

One way they do that is by marking significant parts of the kernel
unsupported. I don't think that's an option for mainline.

-Andi
--

To: Adrian Bunk <bunk@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 7:37 am

On Sun, 20 Apr 2008 14:54:55 +0300

You don't get to dictate to people however.

Alan
--
"If we become a great evil avaricious hegemony, I wanna cool uniform"
-- robk
--

To: Alan Cox <alan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 8:18 am

Everyone is free to patch whatever stacksize he wants into his kernel.

But the more users will get 4k stacks the more testing we have, and the
better both existing and new bugs get shaken out.

And if there were only 4k stacks in the vanilla kernel, and therefore
all people on i386 testing -rc kernels would get it, that would give a
better chance of finding stack regressions before they get into a
stable kernel.

If a distribution or user then wants to increase it that's his choice

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:05 am

Heck, maybe you should make it 2k by default in all -rc kernels; that
way when people run -final with the 4k it'll be 100% bulletproof, right?
'cause all those piggy drivers that blow a 2k stack will finally have
to get fixed? Or leave it at 2k and find a way to share pages for
stacks, think how much memory you could save and how many java threads
you could run!

4K just happens to be the page size; other than that it's really just
some random/magic number picked, and now dictated that if you (and
everyting around you) doesn't fit, you're broken.

That bugs me.

-Eric

(yes, I know there are advantages to only allocating a single page for a
new thread, but from an "all callchains after that must fit in that
space" perspective, it's just a randomly picked number)
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 3:45 am

Some number has to be picked. Why fitting in 4k is "bad" and fitting
in 8k is "not bad"?

Look what happens when this number is too big: Windows is "generous",
and as a result Windows drivers routinely need 12k, sometimes 16k of stack.
We know it from ndiswrapper. We don't want to go that way, right?

Forget about 50k threads. 4k of waste per process is a waste nevertheless.
It's not at all unusual to have 250+ processes, and 250 processes with 8k
stack each waste 1M. Do you think extra 1M won't be useful to have?

It seems that 4k works for everybody sans xfs. Making it work took some effort,
but it is already done. Why not use it after all?

And since i386 is such a common architecture, other 32-bit arches will be
relieved from the burden of hunting down stack overflows which happen
only on those arches. (For example, different ABI or different gcc behavior
may make $OTHER_ARCH slightly more stack-greedy). God knows non-mainstream
arches have enough problems already.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 9:29 am

Because well-written code in several subsystems, used in combination in
common configurations, does not always fit, that is why.

Show me the "bug" in an nfs+xfs+md+scsi writeback stack oops and I'm
sure it'll get "fixed." But if it's simply complex code that happens to
need >4k, I will continue to argue that the limited stack size selection
is the problem, not the code running in it.

Perhaps not surprisingly, ext4, which is significantly more complex than
ext3, has many more individual functions > 100 bytes than ext3 has. As
others have said, there is no trend towards smaller, simpler, less
interesting, and less functional code which fits in a smaller and
smaller footprint in the general case.

If someone has a workload and configuration which happens to fit in 4k
then turn it on, test the heck out of it, and have fun. I've not seen
what I consider to be a convincing argument for making it the default
for everyone.

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 3:51 pm

Why nfs+xfs+md+ide works? Does scsi intrinsically require more stack
than ide?

Why xfs code is said to be 5 timed bigged than e.g. reiserfs?

8k stack is limited too. Other Operating System, no doubt in the name
of better stability, has even larger stack (16k or more).

For what its worth, I do realize that there is a point of diminishing
returns and increased pain when one tries to reduce stack usage.

Conversely:

"If someone is strongly concerned about possibility of stack overflow,
then turn on 8k option, and enjoy the benefits of wide testing which
is provided by millions of people who run 4k stacks. If _that_ works
ok in practice, 8k _ought_ to be 100.00% safe versus stack overflow".

These threads about 4k stack seem to degenerate in ping-ponging
of these arguments again and again.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 9:28 pm

Luck?

With 4k stacks, you really don't need NFS at all - you just have
enter memory reclaim at the wrong time (i.e. when something else

If we cut the bulkstat code out, the handle interface, the
preallocation, the journalled quota, the delayed allocation, all the
runtime validation, the shutdown code, the debug code, the tracing

Writeback is done under ENOMEM pressure, and XFS can't provide the
guarantees mempools need to work. That leaves the stack as the only
place we can put the things we need. e.g. the args structures that
tell the allocator what to do and retain state between subsequent
low level allocation calls use ~250 bytes of stack just by
themselves....

We've already chopped off the low hanging fruit, added noinline to
every function definition to prevent compiler heuristics from
blowing out stack usage by 25% and reduced use of temporary
variables as much as possible. There's very little fat left to trim,
and still we can't reliably fit in 4k stacks.

Patches are welcome - I'd be over the moon if any of the known 4k
stack advocates sent a stack reduction patch for XFS, but it seems
that actually trying to fix the problems is much harder than
resending a one line patch every few months....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Denys Vlasenko <vda.linux@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 27, 2008 - 3:27 pm

Just noticed this bit of FUD. Last time I did some static analysis on
stack usage, reiserfs alone would blow away 3k, while xfs was somewhere
below. Reiserfs was improved afaik, but I'd still expect it to be worse
than xfs until shown otherwise.

Maybe reiserfs simply isn't used that much in nfs+*fs+md+whatnot+scsi
setups?

Jörn

--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
--

To: Jörn Engel <joern@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 27, 2008 - 7:02 pm

I'm sorry, but it's not what I said.
I didn't say reiserfs eats less stack. I don't know.
I said it is smaller.

reiserfs/* 821474 bytes
xfs/* 3019689 bytes
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Jörn Engel <joern@...>, David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 27, 2008 - 7:08 pm

FWIW, the reason for that is in large part all the features Dave listed
above, and probably more.

And, while certainly not yet tiny, the recent trend actually is that xfs
is getting a bit smaller:

http://oss.sgi.com/~sandeen/xfs-linedata.png

(note, though - the Y axis does not start at 0) :)

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Jörn Engel <joern@...>, David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 27, 2008 - 8:00 pm

~30% line count reduction? Impressive, especially in this age
of creeping bloat. Thanks.
--
vda
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 8:48 am

At yet, I got four screenfuls of

fs/xfs/XXXXX.c: warning: unused parameter 'foo'

when I added -Wunused_parameter to Makefile.

Sent a few.
I would like to ask you to ACK/NAK every individual patch
in some reasonable period of time, say, 1-3 days. If you NAK a patch,
please let me know what is wrong with it.

I am not eager at all to experience a repeat of aic7xxx
patch saga, when I was not getting any meaningful reply
for months.

Best regards, Denys.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 9:01 am

I know the feeling of resending patches again and again without any
reaction quite well, but that's not David's fault and not true for XFS
patches, so when you try to put pressure on him you hit the wrong

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 9:51 am

Yeah, sorry about that. I was not implying that XFS people were
not responsive.
--
vda
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 10:33 pm

kmem_free() function takes (ptr, size) arguments but doesn't
actually use second one.

This patch removes size argument from all callsites.

Code size difference on 32-bit x86:

# size */fs/xfs/xfs.o
text data bss dec hex filename
391271 2748 1708 395727 609cf linux-2.6-xfs0-TEST/fs/xfs/xfs.o
390739 2748 1708 395195 607bb linux-2.6-xfs1-TEST/fs/xfs/xfs.o

Compile-tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:02 pm

Ack. Pulled into my qa tree.

FWIW, can you send patches in line next time? It makes it easier to
quote them on review....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:09 pm

I didn't expect it to but this does reduce a few things slightly.

On x86_64:

-xfs_attr_leaf_list_int 200
+xfs_attr_leaf_list_int 184

-xfs_dir2_sf_to_block 136
+xfs_dir2_sf_to_block 120

-xfs_ifree_cluster 136
+xfs_ifree_cluster 120

-xfs_inumbers 184
+xfs_inumbers 168

-xfs_mount_free 24

Thanks,
-Eric

--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:35 pm

And on x86, just for the record (fedora 9 config in both cases...)

-xfs_attr_leaf_inactive 36
+xfs_attr_leaf_inactive 32

-xfs_attr_shortform_list 40
+xfs_attr_shortform_list 36

-xfs_da_grow_inode 96
+xfs_da_grow_inode 92

-xfs_dir2_grow_inode 116
+xfs_dir2_grow_inode 104

-xfs_dir2_leaf_getdents 176
+xfs_dir2_leaf_getdents 172

-xfs_dir2_sf_to_block 92
+xfs_dir2_sf_to_block 88

-xfs_ifree_cluster 108
+xfs_ifree_cluster 104

-xfs_inumbers 88
+xfs_inumbers 84

-xfs_lock_inodes 24
+xfs_lock_inodes 28

-Eric
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:03 pm

Hi David,

xfs_flush_pages() does not use some of its parameters, namely:
first, last and fiops.

This patch removes these parameters from all callsites.

Code size difference on 32-bit x86:

text data bss dec hex filename
390739 2748 1708 395195 607bb linux-2.6-xfs1-TEST/fs/xfs/xfs.o
390567 2748 1708 395023 6070f linux-2.6-xfs2-TEST/fs/xfs/xfs.o

Compile-tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:07 pm

These were never removed because they are place holders for
stuff that Linux didn't support when the original port was done.
Now Linux supports range flushes, these functions should be changed
to do that, and hence the first/last parameters will be used.

But the fiopt flag can probably be killed....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:15 pm

FWIW this one actually does not seem to reduce stack usage anywhere.

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 4:57 am

I hope this will not deteriorate into a contest whether
every particular patch reduces stack usage or not, but:

You do not see reduced stack usage in "make checkstack",
because "make checkstack" shows only stack usage caused by
local variables (it analyses sub %esp,NN instructions which
make room for them). Parameters also take up stack, but
they are pushed on stack with push instruction,
and so are invisible in "make checkstack" output.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 8:51 am

Sorry if you took it that way; since the patch was in response to Dave's
mention of accepting stack-reducing patches, I thought it was worth
checking and highlighting whether it seemed to help. It wasn't supposed

Hm, I had assumed that the %esp subtraction also made room for the
arguments pushed onto the stack. Is there no way to analyze that part?

Thanks,
-Eric
--

To: Denys Vlasenko <vda.linux@...>
Cc: Eric Sandeen <sandeen@...>, David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 5:56 am

That on i?86 actually depends on whether -maccumulate-outgoing-args
is on or off (the default is off for -Os and most pre-i686 tunings,
and on for i686 and most post-i686 tunings when not -Os).

Jakub
--

To: Jakub Jelinek <jakub@...>
Cc: Eric Sandeen <sandeen@...>, David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:33 am

I trust you know it better than I.

I removed a few parameters of non-static, non-inline function.
Since at call site gcc has no way of knowing that these parameters
will not be used by callee, and the function is not regparm
(explicitly or implicitly by being static), I am fairly sure
gcc is putting these parameters on stack.

"make checkstack" doesn't see any difference. It can only
mean that "make checkstack" does not account for stack space
taken by parameters, not that there is no difference
in stack usage after this change. That is simply not possible IMO.
--
vda
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:14 pm

Hi David,

xfs_flush_pages() flags parameter is declared as uint64_t, but
code never pass values which do not fit into 32 bits.
All callsites sans one pass zero, and the last one passes
XFS_B_DELWRI, XFS_B_ASYNC or zero.
These values are defined in enum xfs_buf_flags_t and they
all fit in 32 bits.

This patch changes type of the parameter and one variable
which used to pass it to unsigned int.

Code size difference on 32-bit x86:

# size */fs/xfs/xfs.o
text data bss dec hex filename
390567 2748 1708 395023 6070f linux-2.6-xfs2-TEST/fs/xfs/xfs.o
390507 2748 1708 394963 606d3 linux-2.6-xfs3-TEST/fs/xfs/xfs.o

Compile-tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:08 pm

Can you fold this into the previous patch that kills fiopt to
this function?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 5:42 am

Hi David,

xfs_qm_dqpurge() does not use flags parameter.
This patch removes it.

Code size difference on 32-bit x86:

# size */fs/xfs/xfs.o

Compile-tested only.
text data bss dec hex filename
390507 2748 1708 394963 606d3 linux-2.6-xfs3-TEST/fs/xfs/xfs.o
390491 2748 1708 394947 606c3 linux-2.6-xfs4-TEST/fs/xfs/xfs.o

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 4:18 am

FYI: if you want to sumbit xfs patches it makes a lot of sense to send
them to the xfs list..

--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:11 pm

Ok. Will test.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:16 am

Hi David,

xfs_iomap_write_allocate() does not use count parameter.
This patch removes it.

Code size difference on 32-bit x86:

# size */fs/xfs/xfs.o
393457 2904 2952 399313 617d1 linux-2.6-xfs4-TEST/fs/xfs/xfs.o
393441 2904 2952 399297 617c1 linux-2.6-xfs5-TEST/fs/xfs/xfs.o

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:33 pm

Hmmm - I'm wondering if that is actually a bug. Certainly the
code is in conflict with the comment for the function, and
it points out that I could have fixed a recent bug in a better
way.

I'm going to hold off this one until I've had time to look at this
in more detail....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 7:20 am

Hi David,

xfs_bmap_add_free and xfs_btree_read_bufl functions
use some of their parameters only in some cases
(e.g. if DEBUG is defined, or on non-Linux OS :)

This patch removes these parameters using #define hack
which makes them "disappear" without the need of uglifying
every callsite with #ifdefs.

Code size difference on 32-bit x86:
393457 2904 2952 399313 617d1 linux-2.6-xfs6-TEST/fs/xfs/xfs.o
393441 2904 2952 399297 617c1 linux-2.6-xfs7-TEST/fs/xfs/xfs.o

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:43 pm

We don't use pre-processor hacks to hide function variables for different
config options. The XFS header files are messy enough without adding
additional redefinitions of function types to them.

w.r.t xfs_bmap_add_free(), the correct thing to do is to factor the
debug code out into a different function that is only compiled
on debug kernels and remove all the debug checks from xfs_bmap_add_free().

As it is, I don't think that the change is worth the maintenance
cost for a few bytes of stack space in non-critical paths.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 10:28 am

Elimination of completely unused parameters makes sense, but IMHO using
such #define hacks for minuscule code size and stack usage advantages is
not worth it.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 12:17 pm

In busybox this trick is used extensively.

I don't know how to eliminate these unused parameters with less
intervention, but I also don't want to leave it unfixed.

I want to eventually reach the state with no warnings
about unused parameters.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 1:21 pm

Busybox does not have more than one million lines changed in
one release.

In the Linux kernel maintainability is much more important than in

The standard kernel pattern in using empty static inline functions (that
allow type checking).

And I'm not sure whether the number of functions you'd have to change

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Denys Vlasenko <vda.linux@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 1:26 pm

It would be a huge undertaking.

Just building xfs w/ the warning in place exposes tons of unused
parameter warnings from outside xfs as well.

But, if it was deemed important enough, you could go annotate them as
unused, I suppose, and hack away at it... Does marking as unused just
shut up the warning or does it let gcc do further optimizations?

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 4:46 pm

It just shuts up the warning. It is still useful - suppresses
false positives.

I didn't check whether gcc is clever enough to reuse stack space
occupied by unused parameter(s) as a free space for automatic
variables. In theory it is allowed to do that and reduce stack usage
that way.
--
vda
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 1:50 pm

Eh... I meant "no warnings about unused parameters" for fs/xfs/* only,
not for the entire kernel. I filter out other warnings.

I want to do it not as an excercise in perfectionism,
but as means of making sure we do not waste stack
passing useless parameters, which is important for xfs.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Eric Sandeen <sandeen@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 2:28 pm

That's not really maintainable, and the stack gains are too small for

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Eric Sandeen <sandeen@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 3:32 pm

Why? Adding -Wunused -Wunused-parameter in fs/xfs/Makefile:

EXTRA_CFLAGS += -I$(src) -I$(src)/linux-2.6 -funsigned-char
#EXTRA_CFLAGS += -Wunused -Wunused-parameter

and making a test build with it uncommented once in a while
will reveal a bit of fallout, which is then fixed.
busybox source is thrice as big as xfs source
and from the experience I'd say it's not difficult

I promise to take a look at the critical (wrt stack use) path next.
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: Eric Sandeen <sandeen@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 7:53 pm

The problem isn't in the Makefile, the problem are the ugly #ifdef's in
the code.

And for getting the stack problems fixed the effect is anyway by two

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 7:51 am

[ resend: now with patch attached! :) ]

Hi David,

Seven xfs_trans_XXX functions declared in xfs_trans.h
are not using "tp" parameter in non-debug builds,
but it still takes stack space since these functions
are not static and gcc cannot optimize it out.

This patch removes these parameters using #define hack
which makes them "disappear" without the need of uglifying
every callsite with #ifdefs.

Code size difference on 32-bit x86:
=9A393441 =9A =9A2904 =9A =9A2952 =9A399297 =9A 617c1 linux-2.6-xfs7-TEST/f=
s/xfs/xfs.o
=9A393289 =9A =9A2904 =9A =9A2952 =9A399145 =9A 61729 linux-2.6-xfs8-TEST/f=
s/xfs/xfs.o

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
=2D-
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 6:47 pm

Same as my last comments - I don't think the savings are
worth the additional clutter it introduces.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 9:32 am

Hi David,

Inline functions xfs_dir2_dataptr_to_byte and xfs_dir2_byte_to_dataptr
are not using their 1st argument. gcc is able to optimize that out.

I still want to delete these parameters, as they serve no useful purpose
and by removing them I can make gcc to notice some additional
unused variables in the callers of these inlines, and warn me
about that.

There is no object code size difference from this change.

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 9:40 am

Hi David,

This patch deals with remaining cases of unused parameters in fs/xfs/quota/*
as far as I can see so far. The rest of unused parameters
in fs/xfs/quota/* cannot be easily eliminated due to addresses
of functions being taken.

Code size difference on 32-bit x86:
393289 2904 2952 399145 61729 linux-2.6-xfs8-TEST/fs/xfs/xfs.o
393236 2904 2952 399092 616f4 linux-2.6-xfs9-TEST/fs/xfs/xfs.o

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 7:08 pm

I'd just kill the parameters to xfs_qm_hold_quotafs_ref and
xfs_qm_rele_quotafs_ref and I wouldn't worry about removingthe debug-only
id parameter to xfs_qm_dqread as it's not in a stack critical path.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 9:46 am

Hi David,

Inline function xfs_put_perag() in fs/xfs/xfs_mount.h is a no-op.

This patch converts it to no-op macro.

As a result, gcc will emit warning about unused variables,
parameters and so on not in this function, but in its callers,
which is more useful.

This patch, together with previous ones, has already resulted
in more unused params discovered and warned about by gcc.

There is no object code size difference from this change.

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 7:16 pm

xfs_put_perag() is paired with xfs_get_perag() and should never be
called by itself. It is a stub for AG reference counting the
in-memory per-ag structures and, in future, locking to allow us to
avoid certain deadlocks that can occur (rarely) when growing and
shrinking the filesystem.

Also, I've got patches that put stuff in this function, so I'd
prefer to leave it as it is right now...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, April 22, 2008 - 10:08 am

Denys, thanks for going through all this; I didn't mean to discount the
work with the stackcheck reports. I've done a lot of similar xfs
pruning in the past, and every little bit helps. It is still hard to
find significant reductions in the critical callchains though!

If the xfs codebase gets to the point where things are fairly well
cleaned up it might be nice to add the gcc warning to the makefiles, add
unused attributes to the vfs ops vectors as needed, and keep it clean
from this point on...

Thanks,

-Eric
--

To: David Chinner <dgc@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 7:48 am

Hi David,

Seven xfs_trans_XXX functions declared in xfs_trans.h
are not using "tp" parameter in non-debug builds,
but it still takes stack space since these functions
are not static and gcc cannot optimize it out.

This patch removes these parameters using #define hack
which makes them "disappear" without the need of uglifying
every callsite with #ifdefs.

Code size difference on 32-bit x86:
393441 2904 2952 399297 617c1 linux-2.6-xfs7-TEST/fs/xfs/xfs.o
393289 2904 2952 399145 61729 linux-2.6-xfs8-TEST/fs/xfs/xfs.o

Compile tested only.

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
--
vda
--

To: Denys Vlasenko <vda.linux@...>
Cc: David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 11:18 pm

FWIW this one also seems to make no stack difference, at least on x86_64.

Not complaining; just checking it out. :)

If you can shink xfs_bmapi, let me know. :)

Thanks,
-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Denys Vlasenko <vda.linux@...>, David Chinner <dgc@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Tuesday, April 22, 2008 - 12:10 am

FWIW, the path we care about is this path through ->writepage:

(submit_bio)
_xfs_buf_ioapply 32
xfs_buf_iorequest 0
xfs_buf_iostart 0
xfs_buf_read_flags 0
xfs_trans_read_buf 4
xfs_btree_read_bufs 16
xfs_alloc_lookup 56
xfs_alloc_lookup_eq 16
xfs_alloc_fixup_trees 20
xfs_alloc_ag_vextent_near 76
xfs_alloc_ag_vextent 0
xfs_alloc_vextent 48
xfs_bmap_btalloc 164
xfs_bmap_alloc 0
xfs_bmapi 228
xfs_iomap_write_allocate 116
xfs_iomap 20
xfs_map_blocks 16
xfs_page_state_convert 124
xfs_vm_writepage 12
-------------------------------------
checkstack total: 948

Realistically, the onyl thing we can trim anything off is xfs_bmapi,
xfs_bmap_btalloc, xfs_iomap_write_allocate, and xfs_page_state_convert.
It's going to take a lot of work to get any significant change into
those functions given the complexity of them....

FWIW, if we've come through a syscall, the rest of the trace looks
like:

__writepage 0
write_cache_pages 100
generic_writepages 0
xfs_vm_writepages 12
do_writepages 0
__writeback_single_inode 36
sync_sb_inodes 40
writeback_inodes 0
balance_dirty_pages_ratelimited_nr 76
generic_file_buffered_write 96
xfs_write 80
xfs_file_aio_write 12
do_sync_write 140
vfs_write 12
--------------------------------------------
total 604

So the normal case uses 604 bytes prior to entering ->writepage.

It's when we are already using >2k of the stack when we enter
->writepage that we get into trouble....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 4:28 pm

s/timed bigged/times bigger/
--

To: Denys Vlasenko <vda.linux@...>
Cc: Eric Sandeen <sandeen@...>, Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Monday, April 21, 2008 - 5:55 am

If the 1M gives you more reliability (and I think it does) I don't
think it is "wasted". Would you trade occasional crashes for 1MB?
I wouldn't.

Also a typical process uses much more memory than just 4K. If it's
not a thread it needs own page tables and from those alone you're
easily into 10+ pages even for a quite small process. But even threads
in practice have other overheads too if they actually do something.
The 4K won't save or break you.

[BTW if you're really interested in saving memory there are lots
of other subsystems where you could very likely save more. A common
example are the standard hash tables which are still too big]

The trends are also against it: kernel code is getting more and more
complex all the time with more and more complicated stacks of
different subsystems on top of each other. It wouldn't surprise me if
at some point 8KB isn't even enough anymore. Going into the
other direction is definitely the wrong way.

-Andi
--

To: Eric Sandeen <sandeen@...>
Cc: Adrian Bunk <bunk@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 11:41 am

On Sun, 20 Apr 2008 09:05:40 -0500

it wasn't randomly picked; it was based on 2.4 kernels
(where we had 8kb, but that was roughly 2.5Kb or so for the task struct,

yes. Adrian is waay off in the weeds on this one. Nobody but him is suggesting to remove
8Kb stacks. I think everyone else agrees that having both options is valuable; and there
are better ways to find+fix stack bloat than removing this config option.

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 12:03 pm

I'm not arguing for removing the option immediately, but long-term we
shouldn't need it.

This comes from my experience of removing obsolete drivers for hardware
for which also a more recent driver exists:
As long as there is some workaround (e.g. using an older driver or
8k stacks) the workaround will be used instead of the getting proper
bug reports and fixes.

As far as I know all problems that are known with 4k stacks are some
nested things with XFS in the trace.

If this class of issues would get fixed one day, why would it be
valuable to also offer 8k stacks long-term? Especially weigthed
against the fact that with only 4k stacks we will have more people
running into stack problems in -rc kernels if any new ones pop up,
resulting in getting more such problems fixed during -rc.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Arjan van de Ven <arjan@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 11:30 pm

This "as far as I know" is a problem itself. Is it possible to implement (e.g.,
using some form of memory protection in hardware, but I am not an expert here)
an option with 8k stacks that, however, spams the log if the actual usage goes
above 4k, and have this as a default for some time? If 4k stacks are the goal
that is almost achieved, then this debugging option should have zero impact on
performance.

--
Alexander E. Patrakov
--

To: Alexander E. Patrakov <patrakov@...>
Cc: Adrian Bunk <bunk@...>, Arjan van de Ven <arjan@...>, Eric Sandeen <sandeen@...>, Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, April 23, 2008 - 4:57 am

Shouldn't be hard. Use the 8k stack, and have the system mark the second
page as "not present"
If it ever gets used you get a page fault. The page fault handler then
have to mark the page
present before returning, as well as queue up some spam (the call chain
perhaps) for the log.

A less intrusive way is to use 8k stacks as-is, but put a signature in
the second page.
When the process quits, examine the second stack page to see if the
signature
got overwritten. This approach will only show that a problem exists, it
won't
pinpoint exactly what does it.

Helge Hafting

--

To: Eric Sandeen <sandeen@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:21 am

I'm arguing for aiming at having all 32bit architectures with 4k page
size using the same stack size. Not for having -rc kernels differ from

The only architecture that already defaults to 4k stacks is m68knommu,

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

To: Adrian Bunk <bunk@...>
Cc: Alan Cox <alan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 10:56 am

Oh, I know. I'm just saying that 4k seems chosen out of convenience for
memory management, without any real correlation to what you might
actually need to run a thread. They do happen to be roughly equivalent
for many cases, but not all. Setting a default which is not safe for
several common use cases does not seem wise...

I guess what I'm saying is, I don't agree that any callchain which needs
more than 4k of stack indicates brokenness that must be fixed, as
various posts in this thread seem to suggest.

Sure, 1k char buffers on the stack and massive structs and unlimited
recursion we can agree on as things to fix, but complex/deep/stacked
callchains which don't fit in 4k are much more of a grey area.

-Eric
--

To: Shawn Bohrer <shawn.bohrer@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 2:00 pm

On Sat, 19 Apr 2008 09:59:48 -0500

centos, oracle and redflag tend to follow the RHEL/fedora settings.

To be honest, at this point we're at a situation where
* Several very popular distributions have this enabled for 5+ years,
apparently without any real issues (otherwise the enterprise releases
would have turned this off)
* The early "hot known issues" have been resolved afaik, things like
block device stacking, and symlink recursion lookups are either no longer
recursive, or a lot less recursive than they used to be.

There are clear benefits to 4K stacks (no need to reiterate the flamewar,
but worth mentioning)
* Less memory consumption in the lowmem zone (critical for enterprise use,
also good for general performance)
* Kernel stacks at 8K are one of the most prominent order-1 allocations in the
kernel; again with big-memory systems the fragmentation of the lowmem zone
is a problem (and the distros that ship 4K stacks went there because of customer
complaints)

On the flipside the arguments tend to be
1) certain stackings of components still runs the risk of overflowing
2) I want to run ndiswrapper
3) general, unspecified uneasyness.

For 1), we need to know which they are, and then solve them, because even on x86-64 with 8k stacks
they can be a problem (just because the stack frames are bigger, although not quite double, there).
I've not seen any recent reports, I'll try to extend the kerneloops.org client to collect the
"stack is getting low" warning to be able to see how much this really happens.

for 2), the real answer there is "ndiswrapper needs 12kb not 8kb"

for 3), this is hard to deal with but also generally unfounded... you can use this argument against any change in the kernel.

--

To: Arjan van de Ven <arjan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 10:36 pm

Except, apparently, not, at least in my experience.

Ask the xfs guys if they see stack overflows on x86_64, or on x86.

I've personally never seen common stack problems with xfs on x86_64, but
it's very common on x86. I don't have a great answer for why, but

That sounds like a very good thing to collect, and maybe if I re-send a
"clearly state stack overflows at oops time" patch you can easily keep tabs.

Thanks,

-Eric
--

To: Eric Sandeen <sandeen@...>
Cc: Arjan van de Ven <arjan@...>, Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 6:53 pm

We see them regularly enough on x86 to know that the first question
to any strange crash is "are you using 4k stacks?". In comparison,

Why? Because XFS makes extensive use of 64 bit types and so stack
usage in the critical paths changes by a relatively small amount
between 32 bit and 64 bit machines. IIRC, x86_64 only uses about
30% more stack than x86. So given that the stack doubles on x86_64
and we only increase usage (in XFS) from about 1500 bytes to 2000
bytes of stack usage, we have *lots* more stack space to spare on
x86_64 compared to 4k stacks on x86....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--

To: Eric Sandeen <sandeen@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Sunday, April 20, 2008 - 2:11 am

On Sat, 19 Apr 2008 21:36:16 -0500

if you actually go over on x86, it's not unlikely that you're getting close to the edge on 64 bit.

One thing I've learned with the kerneloops.org work is that people don't read

... which makes me think we need to strengthen this part of the kernel.
(and then have kerneloops.org collect the issues)

If there's a clear pattern in the backtraces we will find it.
And then we can fix it... which is absolutely the right thing,
I don't think anyone disagrees with that.

So yes if you can dig up your patch, yes please!

--
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
--

To: Arjan van de Ven <arjan@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 2:33 pm

and lets observe it that 8K stacks are of course still offered, so if
anyone disables 4K stacks in the .config, it will stay disabled.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Arjan van de Ven <arjan@...>, Shawn Bohrer <shawn.bohrer@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>
Date: Saturday, April 19, 2008 - 3:10 pm

While you change the default, maybe move it also from the "Kernel
hacking" menu into the "General setup" menu? An option with default=y
is probably not an option that is targeted towards kernel hackers only.
--
Stefan Richter
-=====-==--- -=-- =--==
http://arcgraph.de/sr/
--

To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, <arjan@...>, <tglx@...>
Date: Saturday, April 19, 2008 - 1:49 pm

There has been a dribble of reports - I don't have the links handy, nor did

I doubt if you're testing things like nfsd-on-xfs-on-md-on-porky-scsi-driver.

Enable CONFIG_DEBUG_STACK_USAGE. Monitor the results. It's so scary that

Apparently not. I wouldn't enable it if I had a distro.

Anyway. We should be having this sort of discussion _before_ a patch
gets merged, no?
--

Previous thread: Make CONFIG_ARP=m under x86_64 by Jan Engelhardt on Friday, April 18, 2008 - 5:21 pm. (4 messages)

Next thread: Re: x86: spinlock ops are always-inlined by Andrew Morton on Friday, April 18, 2008 - 5:31 pm. (2 messages)