login
Header Space

 
 

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

Previous thread: [PATCH try#3] Blackfin ethernet driver: on chip ethernet MAC controller driver by Bryan Wu on Sunday, July 15, 2007 - 1:05 pm. (1 message)

Next thread: Re: [PATCH try#2] Blackfin ethernet driver: on chip ethernet MAC controller driver by Robert Hancock on Sunday, July 15, 2007 - 2:27 pm. (1 message)
To: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>
Date: Sunday, July 15, 2007 - 1:17 pm

If there are exactly two free pages in a system, the odds of starting any
program are not very good. You'll have to swap, and if you do, you can swap
two more pages in order to free enough RAM for the stack.
-- 
The secret of the universe is #@*%! NO CARRIER 

Friß, Spammer: TqB@mxlqnP.7eggert.dyndns.org -kplBVgN@7eggert.dyndns.org
 BpRVhx@1Q.7eggert.dyndns.org ozy0XBmkB@m1Sh.7eggert.dyndns.org
-
To: <7eggert@...>
Cc: Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>
Date: Sunday, July 15, 2007 - 1:46 pm

A thread's kernel stack is a kernel allocation. If you'd fail to allocate it 
you'd supposedly _already_ have swapt out everything that could be swapped out.

Moreoveover -- literally two pages free was hardly his point. The point is 
just that (with a page being the allocation unit) single page allocations 
are guaranteed to succeed if _any_ memory is free, while two adjacent (yes, 
and stacksize aligned) pages will be pretty hard to get by once the system 
has been up and running for some time.

Rene.

-
To: Rene Herman <rene.herman@...>
Cc: <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>
Date: Monday, July 16, 2007 - 9:43 am

So we are in a desperate situation, we can almost make no progress, adding 
another task is going to push the system into an unrecoverable situation, 

That never happened on my servers, therefore I'd opt for the little extra 
security of having spare 4k on the stack. (I made a patch which would 
printk a message if allocating a stack would ever fail).

I'm not at all opposed to letting the guys with zillions of threads 
benefit from having less unused kernel stack, but unless it's secure for
all users, it should not be default=y.

-- 
A beggar walked up to a well-dressed woman shopping on Rodeo
Drive and said, "I haven't eaten anything for days."
She looked at him and said, "God, I wish I had your willpower."
-
To: Bodo Eggert <7eggert@...>
Cc: Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 6:28 pm

Hnng. He was just saying that odds of two pages being buddies is quite 

Given that as Arjan stated Fedora and even RHEL have been using 4K stacks 
for some time now, and certainly the latter being a distribution which I 
would expect to both host a relatively large number of lvm/md/xfs and what 
stackeaters have you users and to be fairly conservative with respect to the 
chances of scribbling over kernel memory (I'm a trusting person...) it seems 
  there might at this stage only be very few offenders left.

Seeing as how single-page stacks are much easier on the VM so that creating 
those zillion threads should also be faster, at _some_ percentage we get to 
say "and now to hell with the rest".

Do also note that with interrupts of the process stack, available stack is 
definitely not halved. I don't have data (if anyone reading does, please 
say) but I expect that on the kinds of busy networked systems that want 
many-thread creation to be fastest, their many concurrent interrupt sources 
might mean they are not actually experiencing less stack at all. That is, 
that "little extra security" you speak of might very well be none at all in 
practice and perhaps even negative.

Getting interrupts onto their own stack(s) certainly made for better (more 
deterministic that is) behaviour as well -- you're then independent on how 
deep into the stack you already are when the interrupt comes in which is 
otherwise anyone's guess. Now I must say I'm not particularly sure why you 
couldn't still also have those even if you don't pick 4K stacks, but as far 
as I'm aware they're a package deal at least today.

Single page stacks are much nicer on anyone. For all I care, that single 
page migth be a larger soft-page. No idea how the hell you'd investigate the 
optimum page-size for any given system, but I quite fully expect it's larger 
than 4K these days _anyway_ even on x86 and for modern loads.

Since Linux doesn't yet have those that's also not very important currently 
though. 4K is...
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>
Date: Monday, July 16, 2007 - 7:26 pm

I have to recompile the fedora kernel rpms (fc6, f7) with 8k stacks on
my i686 server. It's using NFS -&gt; XFS -&gt; DM -&gt; MD (raid1) -&gt; IDE disks.
With 4k stacks it crash (hang) within minutes after using NFS.
With 8k stacks it's rock solid. No crashes within months.

utz


-
To: utz lehmann <lkml123@...>
Cc: Rene Herman <rene.herman@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>
Date: Tuesday, July 17, 2007 - 10:45 am

utz&gt; I have to recompile the fedora kernel rpms (fc6, f7) with 8k
utz&gt; stacks on my i686 server. It's using NFS -&gt; XFS -&gt; DM -&gt; MD
utz&gt; (raid1) -&gt; IDE disks.  With 4k stacks it crash (hang) within
utz&gt; minutes after using NFS.  With 8k stacks it's rock solid. No
utz&gt; crashes within months.

Does it give any useful information when it does crash?  Can you make
a simple test case using ram disks instead of IDE disks and then
building upon that?  

I think I should try to do this myself at some point...

John
-
To: John Stoffel <john@...>
Cc: Rene Herman <rene.herman@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>
Date: Tuesday, July 17, 2007 - 7:46 pm

No, sorry. Nearly always it lock up so hard that even sysrq didn't work
anymore. Most times the console was blanked. If not, there was a line
with "do_irq" or something like that (if i remember correctly).
A few times it continuous oopsing (scrolling like mad).

I think it's just a stack overflow. Knowing that XFS + long IO stack
have problems with 4k stacks. And i have zero crashes with the
recompiled 8k stack kernels. (All kernel are the fedora ones).

Btw: In the past the server runs on slightly different hardware and
without raid1 (NFS -&gt; XFS -&gt; DM -&gt; IDE disk). It runs with 4k stacks. I
had a few crashes, but i blame the hardware for it.

I don't want to make tests with the server. It's my main data storage

Sorry, i don't think i can do this. My other computer, which i can use
for tests, is x86_64 based.
And IFAIK the problem on the XFS side has something to do with looking
for freespace on many AGs. So maybe a bigger and filled filesystem is
needed. And 50GB ram disks are out of question.

utz


-
To: utz lehmann <lkml123@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>
Date: Monday, July 16, 2007 - 9:00 pm

Okay, thanks. That's the usual offender. And only one I've heard of...

Rene.

-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 6:37 pm

This is the core dispute here. Stated differently, I hope you never
design a bridge that I have to drive over.

Correctness first, optimization second. Introducing random and
difficult to trace crashes upon an unsuspecting audience of sysadmins
and users is not a viable option.

If at some point one of the pro-4k stacks crowd can prove that all
code paths are safe, or introduce another viable alternative (such as
Matt's idea for extending the stack dynamically), then removing the 8k
stacks option makes sense.
-
To: Ray Lee <ray-lk@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 6:55 pm

Quite. But unfortunately you didn't actually go into the bit on how given 
seperate interrupt stacks, available stackspace might not actually _be_ less 

I'll do that the minute you prove the current shared 8K stacks are safe. Do 

I'm still waiting for larger soft-pages... does anyone in this thread have a 
clue on their status?

Rene.
-
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:45 pm

You claim 4k+4k is safe, therefore 8k must be safe, too. But if 8k is 
safe, this does not yet prove that you can store 5k+3k in 4k+4k.
-- 
Funny quotes:
38. Last night I played a blank tape at full blast. The mime next door went
    nuts.
-
To: Bodo Eggert <7eggert@...>
Cc: Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 8:28 pm

No, I most certainly do not. I claim proving that 4K and seperate (per cpu) 
interrupt stacks are safe are exactly the same as proving unshared 8K stacks 
are safe. That is, you don't, no such proof exists other than in the eating 
of the pudding. Ray (and you) in considering !CONFIG_4KSTACKS to be "safer" 
than CONFIG_4KSTACKS suggest that _inevitably_ CONFIG_4KSTACKS would leave 
you with less available stack and I pointed out this isn't be the case.

And in fact, I shouldn't have said "exactly" the same. Unshared interrupt 
stacks make for more determistisc behaviour, so you'd have a harder time 
proven anything to some set limit of uncertainty with the shared 8K stacks 

I really have not made any claim of the kind. The argument is that with 
CONFIG_4KSTACKS, availeble stack space isn't inevitably less at any point in 
time.

Rene.
-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 6:06 am

And yet you have a more strict claim than I do. If you are right, I'll be
right, too, because two times less-than-4K is less tham 8K. If I'm wrong
and 8K is not enough, you must be wrong, too, because you can impossibly

Why do you insist on 4Kstacks being good as long as there is _one_ usevase 

, which are a completely different thing which was bundled to 4K-stacks

I don't want my stack to overflow in order to be theoretically able to
prove it does not overflow. I'd rather go for 8K+4K-stacks, and if _you_
have done the proof _you_ wanted to make, we can talk again about
4K-stacks. Then I'll just add up the maximum stack usages and have the

I claim, you can store 5k + 3k on the 8k stack, where 5k is something like
the current worst case for non-interrupt stack and 3k is plenty for
interrupts. Thousands of stable systems with 8K stacks support my claim.

You claimed with 4k + 4k, there is not less available stack space.
(At least for usecases you are interested in, but I'll asume you don't 
 want other usecases to crash.)

If you were right, I'd have enough space on 4k + 4k to store that 5k.
Obviously, thousands of systems disagree by crashing with 4K-stacks.
That's most simple logic.

Off cause I may be wrong and the kernels don't crash because of 4K stacks, 
but because of bad karma ... But even then, you'd first have to get rid of
that bad karma before defaulting to 4K stacks.

-- 
Top 100 things you don't want the sysadmin to say:
41. OH, SH*T! (as they scrabble at the keyboard for ^c).
-
To: Bodo Eggert <7eggert@...>
Cc: Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 10:38 am

Firstly, it's not two times 4K but 4K + (4K + 4K) * NR_CPUS. Secondly, _you_ 
are the one making claims -- specifically that !CONFIG_4KSTACKS is "safer", 
happily ignoring the fact that generally speaking available process stack 
can be _better_ with CONFIG_4KSTACKS and there seems to exist but _one_ 
(één, ein, une) known situation where it's problematic.

Must there be none rather than one? In some senses maybe, if the problem is 
more than bad, fixable code but I doubt you know this. CONFIG_4KSTACKS is 
much better on the VM (and hence faster) and as such, any user not using the 
one nicely isolated and identified problem case benefits from it. This means 
it's either very close or already _at_ the point of being the best default 
for the kernel. Changing options is for users with special needs, as you 
believe you are.

I truly apologise for taking it into this direction but you're wearing me 
down rapidly. Every single time you insert some uninformed crap comment that 
shows that you both don't understand the issue and didn't understand what 
the other person was saying and then after being made aware of such, ignore 
that and follow up with the next uninformed crap comment. That is, you seem 
to care less about the issue then about the discussion and since for me it's 
quite the other way around I'm leaving it at that.

RedHat is the one with the actual data available, and they've been enabling 
4KSTACKS for quite some time now (with some of their users apparently 
unhappy about it but not many it would seem).

Jesper also already posted how he's going to proceed: lift 4K from debug 
status and submit it as default for -mm. As to the latter bit, unless I 
remember wrong, it already _was_ default in -mm for some time a while ago so 
Andrew no doubt has an informed opinion on how to proceed with that.

Rene.
-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 7:19 pm

It can be better, but the worst case stays 4K + 4K - unless one CPU will 
walk over to the next and nicely ask for a cup of stack.

Therefore you can discuss 4K + 4K or 4K + 4K + 4K, or 4K + 4K * \inf. It 
won't change a thing:

1) It all can be reduced to 4K + 4K by asuming all IRQ happen on one CPU.
2) Even if the interrupts decide not to happen on one CPU, you still can't 
   fit that possible 5K into 4K.

Having a local stack per CPU helps locality, and it's gootd, but that's 

One case is reason enough not to enable 4K-stacks per default, and this 

"Look how fast I crashed!" doesn't buy you anything. In order to finish 


If you designed a car, you would also go for breaks with a well-known 
problem just because they weight less and all that people not crossing 
mountains would be happy about the weight benefit - that is if they'd 
notice, wouldn't you?


I put the facts onto the ground. If you're getting down, you may stumble 

So what did you say about the worst case stack size being bigger than 4K?
That's correct, you choose to put it aside as a minor use case. Yea, it's
just the combination you'd choose for a reliable server setup, the users
won't have a problem when their systems crash ...

Was your claim about each CPU having a separate stack helping your cause?
No, everybody can see it's not. That is, except for you, your CPU will
just borrow some, since their neighbours have some free stack.


But let's not stop here: You claimed: "Unshared interrupt 
stacks make for more determistisc behaviour, so you'd have a harder time 
proven anything to some set limit of uncertainty with the shared 8K stacks 
than with the unshared 4K stacks."

So you want to tell me I can't prove 8K stacks are safe - you are right.
But can you prove 4K stacks are safe? You can't either. But you want to be
able to prove it. I told you to stick to your words - go and prove 4K+4K
to be safe. What did you do? You chose to ignore that.

I bet you don't even consider proving 4K stacks ...
To: Bodo Eggert <7eggert@...>
Cc: Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 7:57 pm

- The ability to read
- The ability to understand

You're doing a hell of a job already.

Rene.
-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 1:05 pm

If you designed them like you design secure systems, that explains a lot.

-- 
Top 100 things you don't want the sysadmin to say:
83. Damn, and I just bought that pop...
-
To: Bodo Eggert <7eggert@...>
Cc: Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>
Date: Tuesday, July 17, 2007 - 7:29 pm

no it's separate stacks for soft and hard irqs, so it's really 4+4+4


another angle is that while correctness rules, userspace correctness
rules as well. If you can't fork enough threads for what you need the
machine for, why have the machine in the first place?

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-
To: Arjan van de Ven <arjan@...>
Cc: Bodo Eggert <7eggert@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>
Date: Thursday, July 19, 2007 - 1:20 pm

Thanks, I missed that information. Unfortunately this change still does 

Userspace can't work correctly after the kernel crashed, but it can fail 
gracefully if it can't create enough threads.

I'd really like to be able to select 4K stacks, but as long as that stack
would overflow, I can't, and it can't be default, too.
-- 
Top 100 things you don't want the sysadmin to say:
8. ...and after I patched the microcode...
-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:13 pm

Language barrier, I think, or perhaps I was unclear. Please read that
as "4k stacks introduce no new stack bugs." And if we put wli's
unconditional interrupt stacks into the kernel, it's pretty obvious
that 8k stacks are at least as safe in that case as 4k stacks.

Given that there's actual, y'know, reports of people who can easily
crash a 4k+interrupt stacks kernel, and not an 8k one, I think the
current evidence speaks for itself.

The point remains that the burden of proof of the safety of the 4k
only option is upon those people who want to remove the 8k option.
-
To: Ray Lee <ray-lk@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:22 pm

Removing any such option was not the objective of this thread, just lifting 
4K stacks from debug and making it the default. People fortunate enough to 
use workloads where some piece of crap code by accident works more often 
with the current shared 8K stacks then it does with the unshared 4K stacks 
can then still nicely not select it (or fix the code if possible).

Rene.



-
To: Rene Herman <rene.herman@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:40 pm

The second message in this thread, according to my reader, from Zan

True, but your messages are reading as advocacy for removing the 8k
option. I'm saying that's a bad idea. If I misunderstood your

If they even realize that it's the cause of the problem. In the
meantime, we're generating more bug reports to lkml. As the general
opinion is that the ones getting received now aren't getting enough
attention (see regression tracking threads), setting a default that is
known to break setups in hard to debug ways seems counterproductive.
-
To: Ray Lee <ray-lk@...>
Cc: Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 8:57 pm

True enough. I'm rather wondering though why RHEL is shipping with it if 
it's a _real_ problem. Scribbling junk all over kernel memory would be the 
kind of thing I'd imagine you'd mightely piss-off enterprise customers with. 

I personally believe that CONFIG_4KSTACKS is not better only for very few 
users but well, no, if even some users exist, then I wouldn't want to 
suggest they'd be disallowed (shared or unshared) 8K stacks.

I _would_ in fact suggest there are few enough left that rather than 4K, 8K 
should really be the option that only those few would select, but ofcourse, 
given config defaults that's mostly a matter of semantics, so who cares in 

Well, no. "oldconfig" works fine, and other than that, all failure modes 
I've heard about also in this thread are MD/LVM/XFS. This is extremely 
widely tested stuff in at least Fedora and RHEL. "hard to debug" is simply 
not the case -- every one will immediately start yelling 4KSTACKS when that 
software-stack appears anywhere.

Don't try and hang this off generic development unease... ;-)

Rene.

-
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 12:14 pm

I can't speak for Fedora, but RHEL disables XFS in their kernel likely

Again don't assume that because Fedora and RHEL have 4K stacks means
that MD/LVM/XFS is widely tested.

Additionally I think I should point out that the problems pointed out so
far are not the only problem areas with 4K stacks.  There are out of
tree drivers to consider as well, and use cases like ndiswrapper.
-
To: Shawn Bohrer <shawn.bohrer@...>
Cc: Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 8:16 pm

-was- - the SGI folks submitted patches to deal with some gcc problems
with stack usage.
-
To: Shawn Bohrer <shawn.bohrer@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 12:52 pm

Okay. So is it fair to say it's largely XFS that's the problem? No problems 
with LVM/MD and say plain ext? If that's the case, I believe it could be 
concluded that it's not something in any sense fundamentally unfixable and 

No, quite, that specific combination was reported in this thread alone 3 
times again, so that one's clear, but _other_ than that, I've heard of no 

Except these. Good to have pointed out, thanks, but as far as I'm concerned 
both these cases do not get a say in what's default configuration for the 
kernel.org kernel. They might get a say in what's removed or not removed 
from that kernel but that's not under discussion at the moment (nor would I 
expect it to be anytime soon if ever).

Rene.

-
To: Rene Herman <rene.herman@...>
Cc: Shawn Bohrer <shawn.bohrer@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 1:45 pm

There *are* crashes from LVM and ext3.  I had to change kernels to avoid
them.

I had crashes with ext3 on LVM snapshot on DM mirror on SATA.
--=20
Zan Lynx &lt;zlynx@acm.org&gt;
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:35 pm

I may have worded the initial email a bit too heavily towards 8K
removal, but that was mainly to provoke some discussion.  Nothing
should get removed without a fair (and long) warning in
feature-removal-schedule.txt and ofcourse not before we know that the
new option is at least as safe as what we intend to remove, so the
patch really was just intended as a fairly harmless nudge towards
getting 4K stacks into a shape where we can eventually start
considering 8K removal.

-- 
Jesper Juhl &lt;jesper.juhl@gmail.com&gt;
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html
-
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:07 pm

Given that most x86 users won't want anything to do with them, it's
not going to help us at all here.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:12 pm

No idea why not? Is this something you expect, or know, or... ? (and who are 
users in this context?)

Rene.


-
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:27 pm

Larger soft pages waste tremendous amounts of memory (mostly in page
cache) for minimal benefit on, say, the typical desktop. While there
are workloads where it's a win, it's probably on a small percentage of
machines.

So it's absolutely no help in fixing our order-1 allocation problem
because we don't want to force large pages on people.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 8:15 pm

Using kmalloc(8k) instead of alloc_page() doesn't sound a too big deal
and that will solve the problem. The whole idea is to avoid the memcpy
+ pte mangling of defrag while hopefully lowering cpu utilization in
allocations at the same time.

About 4k stacks I was generally against them, much better to fail in
fork than to risk corruption. The per-irq stack part is great feature
instead (too bad it wasn't enabled for the safer 8k stacks).

Failing in a do_no_page with variable order page size allocation is a
fatal event (the task will be killed), failing in fork is graceful,
userland can retry etc... Fork can fail for different reasons, ulimit
itself is the most likely source of fork failures. I don't think the
8k stacks have ever been a problem, yes you will run out of stack
sooner (sooner also because the 4k stacks takes less memory) but
nothing is terribly wrong if the 8k allocation fails.
-
To: Andrea Arcangeli <andrea@...>
Cc: Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 8:41 pm

How do you figure?

If you're saying that soft pages helps our 8k stack allocations, it
doesn't. The memory overhead of soft pages will be higher (5-15%,
mostly due to file tails in pagecache) than the level at which 8k
stacks currently run into trouble (1-2% free?).

Not helpful.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Andrea Arcangeli <andrea@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 8:48 pm

With tail-packing it is.

Rene.

-
To: Rene Herman <rene.herman@...>
Cc: Andrea Arcangeli <andrea@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 9:28 pm

Tail packing is a whole new can of worms. Especially as it's very
likely to make performance suffer on small files (the common case).

On the other hand, if someone can demonstrate that tail-packed page
cache doesn't suck, we should put it in mainline pronto. The poor
architectures that are stuck with real 64k pages are sure to
appreciate it.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Andrea Arcangeli <andrea@...>
Cc: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 8:39 pm

8K stacks without IRQ stacks are not "safer" so I don't understand your
comment ?
-
To: Alan Cox <alan@...>
Cc: Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Friday, July 27, 2007 - 9:03 am

Hmm was it SuSE or RH kernels (or mainline?) I saw which had a test to
defer soft IRQs if they occurred too deep in the stack for the current
thread.

-Eric
-
To: Eric Sandeen <sandeen@...>
Cc: Alan Cox <alan@...>, Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Friday, July 27, 2007 - 1:18 pm

Perhaps the "8 KB softpage" should be an option instead
of 8 KB stack size?

Not sure about ABI compatibility.
-- 
Krzysztof Halasa
-
To: Alan Cox <alan@...>
Cc: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 9:33 pm

Ouch, see the reports about 4k stack crashes. I agree they're not
safe w/o irq stacks (like on x86-64), but they're generally safer.
-
To: Andrea Arcangeli <andrea@...>
Cc: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 5:23 am

On Thu, 19 Jul 2007 03:33:58 +0200

Still don't follow. How is "exceeds stack space but less likely to be
noticed" safer.

Alan
-
To: Alan Cox <alan@...>
Cc: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 6:52 am

Statistically speaking it clearly is. The reason is probably that the
irq theoretical issue happens only on large boxes with lots of
reentrant irqs. Not all irqs are reentrant, not all systems runs lots
of irqs at the same time etc..
-
To: Andrea Arcangeli <andrea@...>
Cc: Alan Cox <alan@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 9:37 pm

Here's a way to make forward progress on this whole thing:

Turn on irqstacks when using 8k stacks
Detect when usage with 8k stacks would overrun a 4k stack when doing
 our stack switch and do a WARN_ONCE
Fix up the damn bugs

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Alan Cox <alan@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 7:24 am

I don't think they're necessarily bugs. IMHO the WARN_ON is better off
at 7k level like it is today with the current STACK_WARN. 4k for a
stack for common code really is small. I doubt you're going to find
obvious culprits that way, more likely you'll have to mangle the code
to call kmalloc for fairly small structures which isn't necessarily a
good thing in the long term. It comes to mind the folio ptes array
that Hugh allocated on the stack in his large PAGE_SIZE patch of jul
2001, that thing like any other local array, would need to be
kmalloced with a 4k stack. With 4k I'm afraid you better not use the
stack for anything but pointers, especially if you run in common code
that may invoke I/O like that.
-
To: Andrea Arcangeli <andrea@...>
Cc: Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 7:44 am

You want the limit settable. On a production system you want to set the
limit to somewhere appropriate for the stack size used. When debugging
(eg to remove any last few bogus users of 8K stack space) you want to be
able to set it to just under 4K

Alan
-
To: Alan Cox <alan@...>
Cc: Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Friday, July 27, 2007 - 9:02 am

Hm, when cramming cxfs into 4k at sgi, I had a patch that did just that
for debugging (warn about encroaching on 4k without actually tipping
over, with a settable threshold...)

Maybe I should resurrect it &amp; send it out...

(FWIW I think I recall that the warning itself sometimes tipped the
scales enough on 4k stacks to bring the box down)

-eric
-
To: Eric Sandeen <sandeen@...>
Cc: Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Friday, July 27, 2007 - 1:38 pm

You can always switch stack for the printk and it probably should panic
at that point and give a trace then die as that is what we are trying to
prove does not occur
-
To: Alan Cox <alan@...>
Cc: Eric Sandeen <sandeen@...>, Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Friday, July 27, 2007 - 2:31 pm

Hmm, something that hooks in not only at do_IRQ time (as the present

Yes, only yesterday I saw exactly this happening DEBUG_STACKOVERFLOW
when doing a udf -&gt; pktcdvd -&gt; cdrom -&gt; ide_cd thing. It's one of those
reproducible will-crash-4k-stacks tests, especially if you have debug stuff
enabled in your build that would make on-stack structures (where such
exist on the codepath) a bit heavier.

Admittedly, what seems to have happened is a bit pathological:

[  481.836378] cdrom: entering cdrom_count_tracks
[  481.844266] BUG: sleeping function called from invalid context at
include/asm/semaphore.h:98
[  481.844434] do_IRQ: stack overflow: 164
[  481.844540]  [&lt;c0405cfe&gt;] show_trace_log_lvl+0x19/0x2e
[  481.844707]  [&lt;c0405dfe&gt;] show_trace+0x12/0x14
[  481.844867]  [&lt;c0405e14&gt;] dump_stack+0x14/0x16
[  481.845027]  [&lt;c0406ff6&gt;] do_IRQ+0x7b/0xe1
[  481.845186]  [&lt;c040583e&gt;] common_interrupt+0x2e/0x34
[  481.845348]  [&lt;c042b8e7&gt;] printk+0x1b/0x1d
[  481.845507]  [&lt;c0422c05&gt;] __might_sleep+0x81/0xdc
[  481.845668]  [&lt;c066d869&gt;] __reacquire_kernel_lock+0x2d/0x4f
[  481.845833]  [&lt;c066b09b&gt;] schedule+0x78a/0x7a4
[  481.845996]  [&lt;c066b538&gt;] wait_for_completion+0x72/0x97
[  481.846160]  [&lt;c05937a6&gt;] ide_do_drive_cmd+0xeb/0x109
[  481.846324]  [&lt;f89172a2&gt;] cdrom_queue_packet_command+0x40/0xc5 [ide_cd]
[  481.846503]  [&lt;f89175b7&gt;] ide_cdrom_packet+0x86/0xa4 [ide_cd]
[  481.846669]  [&lt;f8854dc1&gt;] cdrom_get_disc_info+0x48/0x87 [cdrom]
[  481.846839]  [&lt;f8854ec6&gt;] cdrom_get_last_written+0x2a/0xfe [cdrom]
[  481.847009]  [&lt;f891831b&gt;] cdrom_read_toc+0x39d/0x3f3 [ide_cd]
[  481.847231]  [&lt;f8918e7e&gt;] ide_cdrom_audio_ioctl+0x130/0x1ce [ide_cd]
[  481.847414]  [&lt;f8854123&gt;] cdrom_count_tracks+0x5c/0x126 [cdrom]
[  481.847583]  [&lt;f8855688&gt;] cdrom_open+0x147/0x79c [cdrom]
[  481.847748]  [&lt;f891799a&gt;] idecd_open+0x75/0x8a [ide_cd]
[  481.847912]  [&lt;c04aac0e&gt;] do_open+0x1...
To: Satyam Sharma <satyam.sharma@...>
Cc: Alan Cox <alan@...>, Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 31, 2007 - 11:53 pm

No, what I had did only that, so it was still a matter of probabilities...

-Eric

-
To: Eric Sandeen <sandeen@...>
Cc: Satyam Sharma <satyam.sharma@...>, Alan Cox <alan@...>, Andrea Arcangeli <andrea@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, August 1, 2007 - 4:11 am

How expensive would it be to allocate two , then use the MMU mark the
second page unwritable? Hardware wise it should be possible,  (for
constant 4k pagesizes, I have not worked with variable pagesize MMUs)
and since it's a per-context-switch constant operation, it would be a
special case in the fault handler rather then adding another entry to
the VM for every process.

Using large hardware pages to cover the kernel mapping could be worked
around by leaving the area where the current process stack resides
mapped via 4k pages.  Of course, I haven't touched a modern PC MMU in
ages, so I could be missing something fundamentally difficult.

The other issue is with the layered IO design - no matter what we
configure the stack size to, it is still possible to create a set of
translation layers that will cause it to crash regularly:  XFS on
dm_crypt on loop on XFS on dm_crypt on loop on ad infinitum.

That said, I'm missing something here - why is the stack growing?
Filesystems should be issuing bios with callbacks, so they should be
back off the stack, same with dm, loop, etc.   Am I missing step where
they use a wrapper function that pretends to be syncronous?
-
To: Dan Merillat <dan.merillat@...>
Cc: Eric Sandeen <sandeen@...>, Satyam Sharma <satyam.sharma@...>, Alan Cox <alan@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, August 1, 2007 - 9:33 am

Tweaking kernel ptes is prohibitive during clone() because that's
kernel memory and it would require a flush tlb all with IPIs that
won't scale (IPIs are really the blocker). Basically vmalloc already
does what you suggest with the gap page and yet we can't use it for
performance reasons. Kernel stack should be readable by any context to
allow sysrq+t kind of things, so I doubt it's feasible to do tricks to
avoid ipis.
-
To: Andrea Arcangeli <andrea@...>
Cc: Dan Merillat <dan.merillat@...>, Eric Sandeen <sandeen@...>, Satyam Sharma <satyam.sharma@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, August 1, 2007 - 11:44 am

On Wed, 1 Aug 2007 15:33:58 +0200

Agreed - except when doing debug work then its an acceptable cost. You
still have to sort the debug side out because you are going to fault the
kernel stack which will probably then cause a triple fault and reboot on
the spot.
-
To: Alan Cox <alan@...>
Cc: Andrea Arcangeli <andrea@...>, Eric Sandeen <sandeen@...>, Satyam Sharma <satyam.sharma@...>, Matt Mackall <mpm@...>, Rene Herman <rene.herman@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>, Denis Vlasenko <vda.linux@...>
Date: Thursday, August 9, 2007 - 9:03 pm

I was assuming debugging work, yes.  I was also thinking it wouldn't
be done at clone() time, but mapped (on a single CPU) at the time of a
context switch.  It would eliminate IPI, but would probably make the
rest of the TLB handling much too ugly to contemplate.    As an
alternative, could the TLB flush and associated IPI be deferred until
the process migrates?   First migration would trigger flush/IPI,
further migration would be as now, no?   I'd happily run it with
various dm/md layers underneath


Because the kernel mapping covers all physical memory contiguously, so
if the page isn't allocated, it could be used by a kernel data
structure you need to access.  Same reason the kernel stack has to be
contiguous pages.   Well, for non-highmem at least.  Either way, you
don't want to mark an in-use page as inaccessable, you never know
what's under there.
-
To: Matt Mackall <mpm@...>
Cc: Andrea Arcangeli <andrea@...>, Alan Cox <alan@...>, Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 9:56 pm

WLI: are you submitting? Makes great sense regardless of anything and 


DM ofcourse is fairly "layered-by-design" so I _hope_ they can be classified 
simple bugs...

Rene.
-
To: Matt Mackall <mpm@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Tuesday, July 17, 2007 - 10:38 pm

I was just now looking at how much space is in fact wasted in pagecache for 
various pagesizes by running the attached dumb little program from a few 
selected directories (heavy stack recursion, never mind).

Well, hmmm. This is on a (compiled) git tree:

rene@7ixe4:~/src/linux/local$ pageslack
total	: 447350347
  4k	: 67738037 (15%)
  8k	: 147814837 (33%)
16k	: 324614581 (72%)
32k	: 724629941 (161%)
64k	: 1592785333 (356%)

Nicely constant factor 2.2 instead of the 2 one would expect but oh well. On 
a collection of larger files the percentages obviously drop. This is on a 
directory of ogg vorbis files:

root@7ixe4:/mnt/ogg/.../... # pageslack
total	: 70817974
  4k	: 26442 (0%)
  8k	: 67402 (0%)
16k	: 124746 (0%)
32k	: 288586 (0%)
64k	: 419658 (0%)

The "typical desktop" is presented by neither I guess but does involve audio 
and (much larger still) video and bloody huge browser apps.

Not too sure then that 8K wouldn't be something I'd want, given fewer 
pagefaults and all that...

Rene.
To: Rene Herman <rene.herman@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 12:54 pm

I'd be surprised if a user had substantially more than one OGG, video,
or browser in memory at one time. In fact, you're likely to find only
a fraction of each of those in memory at any given time.

Meanwhile, they're likely to have thousands of small browser cache,
thumbnail, config, icon, maildir, etc. files in cache. And hundreds of
medium-sized libraries, utilities, applications, and so on.

You can expect the distribution of file sizes to follow a gamma
distribution, with a large hump towards the small end of the spectrum

Fewer minor pagefaults, perhaps. Readahead already deals with most of
the major pagefaults that larger pages would.

Anyway, raising the systemwide memory overhead by up to 15% seems an
awfully silly way to address the problem of not being able to allocate
a stack when you're down to your last 1 or 2% of memory! In all
likelihood, we'll fail sooner because we're completely OOM.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Wednesday, July 18, 2007 - 1:17 pm

Well, I've seen larger pagesizes submerge in more situations, specifically 
in allocation overhead -- ie, making the struct page's fit in lowmem for 
hugemem x86 boxes was the first I heard of it. But yes, otherwise (also) 
mostly database loads which obviously have moved to 64-bit since.

Pagecache tail-packing seems like a promising idea to deal with the downside 
of larger pages but I'll admit I'm not particularly sure how many _up_ sides 
to them are left on x86 (not -64) now that's becoming a legacy architecture 
(and since you just shot down the pagefaults thing).

Rene.
-
To: Matt Mackall <mpm@...>
Cc: Ray Lee <ray-lk@...>, Bodo Eggert <7eggert@...>, Jeremy Fitzhardinge <jeremy@...>, Jesper Juhl <jesper.juhl@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:32 pm

Okay. I would've expected that 4K was fairly tiny for today's loads but as 
usual I'm relatively data challenged so I guess I'll take your word for it. 
Bummer.

Rene.

-
To: Ray Lee <ray-lk@...>
Cc: Rene Herman <rene.herman@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 6:54 pm

Please note that I was not trying to remove the 8K stack option right
now - heck, I didn't even add anything to feature-removal-schedule.txt
- all I wanted to accomplish with the patch that started this threas
was;  a) indicate that the 4K option is no longer a debug thing  and
b) make 4K stacks the default option in vanilla kernel.org kernels as
a gentle nudge towards getting people to start fixing the code paths
that are not 4K stack safe.
Distros that currently use 8K stacks can continue to do so just fine,
individuals compiling their own kernel.org kernels can as well and
people using oldconfig wouldn't get any change, only people
configuring a new kernel.org kernel from scratch would see a change.
It was mostly meant as a hint that we want to move in the 4K stack
direction over time...
In the future (perhaps far future) when all 4K unsafe codepaths are
believed to have been fixed an entry could be made in
feature-removal-schedule.txt stating that the 8K option would go away
in 6, 12 or whatever, months.   That was my intention with the patch I
posted, I never intended to rip out 8K stacks anytime *soon*.


-- 
Jesper Juhl &lt;jesper.juhl@gmail.com&gt;
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html
-
To: Jesper Juhl <jesper.juhl@...>
Cc: Ray Lee <ray-lk@...>, Rene Herman <rene.herman@...>, Bodo Eggert <7eggert@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Monday, July 16, 2007 - 7:42 pm

That's the big NACK. It's OK for MM, where things are supposed to be in a 
not well-tested state, but for running possibly mission-critical systems,
you should take no risk.

If you'd run a 4K stack on the NFS+XFS+LVM+dmcrypt+MD+somethingmore 
setup driving your loved one's life support, you may go ahead.
-- 
I'm a member of DNA (National Assocciation of Dyslexics).
	-- Storm in &lt;5Z4Z7.52353$4x4.6445347@news2-win.server.ntlworld.com&gt;
-
To: Bodo Eggert <7eggert@...>
Cc: Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Rene Herman <rene.herman@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 3:34 pm

Mission-critical machines are not supposed to have kernel configured
with incompetent/careless sysadmin who didn't think about
config choices he made at kernel build time.
--
vda
-
To: Denis Vlasenko <vda.linux@...>
Cc: Bodo Eggert <7eggert@...>, Jesper Juhl <jesper.juhl@...>, Ray Lee <ray-lk@...>, Rene Herman <rene.herman@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kernel@...>, William Lee Irwin III <wli@...>, David Chinner <dgc@...>, Arjan van de Ven <arjan@...>
Date: Thursday, July 19, 2007 - 4:04 pm

Is it careless to asume good code quality for default options?
Does the 4K stack come with a big red warning about crashing the kernel?
(I just checked, it does not, only benefits are listed.)
Are 4K stacks so obviously flawed nobody would use them for reliable systems?
Or is each sysadmin supposed to read LKML in order to find out about the
pitfalls you designed for them?
-- 
Top 100 things you don't want the sysadmin to say:
55. NO!  Not _that_ button!
-
To: Bodo Eggert <7eggert@...>
Cc: Ray Lee <ray-lk@...>, Rene Herman <rene.herman@...>, Matt Mackall <mpm@...>, Jeremy Fitzhardinge <jeremy@...>, Linux Kernel Mailing List <linux-kern