Hi,
Here's a couple of patches to improve the memory barrier situation on x86.
They probably aren't going upstream until after the x86 merge, however I'm
posting them here for RFC, and in case anybody wants to backport into stable
trees.---
movnt* instructions are not strongly ordered with respect to other stores,
so if we are to assume stores are strongly ordered in the rest of the x86_64
kernel, we must fence these off (see similar examples in i386 kernel).[ The AMD memory ordering document seems to say that nontemporal stores can
also pass earlier regular stores, so maybe we need sfences _before_ movnt*
everywhere too? ]Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/arch/x86_64/lib/copy_user_nocache.S
===================================================================
--- linux-2.6.orig/arch/x86_64/lib/copy_user_nocache.S
+++ linux-2.6/arch/x86_64/lib/copy_user_nocache.S
@@ -117,6 +117,7 @@ ENTRY(__copy_user_nocache)
popq %rbx
CFI_ADJUST_CFA_OFFSET -8
CFI_RESTORE rbx
+ sfence
ret
CFI_RESTORE_STATE-
According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Special string stores are safe -- their
constituent stores may be out of order, but they must complete in order WRT
surrounding stores. Nontemporal stores to WB memory can go out of order, and so
they should be fenced explicitly to make them appear in-order WRT other stores.
Hence, smp_wmb() may be a simple barrier.http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/...In userspace microbenchmarks on a core2 system, fence instructions range
anywhere from around 15 cycles to 50, which may not be totally insignificant
in performance critical paths (code size will go down too).However the primary motivation for this is to have the canonical barrier
implementation for x86 architecture.smp_rmb on buggy pentium pros remains a locked op, which is apparently
required.Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -274,7 +274,11 @@ static inline unsigned long get_limit(un#ifdef CONFIG_SMP
#define smp_mb() mb()
-#define smp_rmb() rmb()
+#ifdef CONFIG_X86_PPRO_FENCE
+# define smp_rmb() rmb()
+#else
+# define smp_rmb() barrier()
+#endif
#ifdef CONFIG_X86_OOSTORE
# define smp_wmb() wmb()
#else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/in...
...
Great news!
First it looks like a really great thing that it's revealed at last.
But then... there is probably some confusion: did we have to use
ineffective code for so long?First again, we could try to blame Intel etc. But then, wait a minute:
is it such a mystery knowledge? If this reordering is done there are
some easy rules broken (just like in examples from these manuals). And
if somebody cared to do this for optimization, then this is probably
noticeable optimization, let's say 5 or 10%. Then any test shouldn't
need to take very long to tell the truth in less than 100 loops!So, maybe linux needs something like this, instead of waiting few
years with each new model for vendors goodwill? IMHO, even for less
popular processors, this could be checked under some debugging option
at the system start (after disabling suspicios barrier for a while
plus some WARN_ONs).Thanks,
Jarek P.
-
I think the chip manufacturers really wanted to keep their options open.
Having the option to re-order loads in architecturally visible ways was
something that they probably felt they really wanted to have. On the other
hand:- I bet they had noticed that things break, and some applications depend
on fairly strong ordering (not necessarily in Linux-land, but..)I suspect hw manufacturers go through life hoping that "software
improves". They probably thought that getting rid of the old 16-bit
windows would mean that less people depended on undefined behaviour.And I suspect that they started noticing that no, with threads and
JVM's and things, *more* people started depending on fairly strong
memory ordering.- I suspect Intel in particular noticed that they can do a lot of very
aggressive re-ordering at a microarchitectural level, but can still
guarantee that *architecturally* they never show it (dynamic detection
of reordered loads being replayed on cache dirty events etc).IOW, I suspect that both Intel and AMD noticed that while they had wanted
to keep their options open, those options weren't really realistic, and
not something that the market wanted (aggressive use of threading wants
*stricter* memory ordering, not looser), and they could work well enoughQuite frankly, even *within* Intel and AMD, there are damn few people who
understand exactly what the memory ordering requirements and guarantees
are and historically were for the different CPU's.I would bet that had you asked a random (but still competent) Intel/AMD
engineer that wasn't really intimately involved with the actual design of
the cache protocols and memory pipelines, they would absolutely not have
been able to tell you how the CPU actually worked.So no, there's no way a software person could have afforded to say "it
seems to work on my setup even without the barrier". On a dual-socket
setup with s shared bus, that says absolutely *nothin...
Yes, I still can't believe this, but after some more reading I start
to admit such things can happen in computer "science" too... I've
mentioned a lost performance, but as a matter of fact I've been more
concerned with the problem of truth:From: Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Volume 3A:"7.2.2 Memory Ordering in P6 and More Recent Processor Families
...
1. Reads can be carried out speculatively and in any order.
..."So, it looks to me like almost the 1-st Commandment. Some people (like
me) did believe this, others tried to check, and it was respected for
years notwithstanding nobody had ever seen such an event.And then, a few years later, we have this:
From: Intel(R) 64 Architecture Memory Ordering White Paper
"2 Memory ordering for write-back (WB) memory
...
Intel 64 memory ordering obeys the following principles:
1. Loads are not reordered with other loads.
..."I know, technically this doesn't have to be a contradiction (for not
WB), but to me it's something like: "OK, Elvis lives and this guy is
not real Paul McCartney too" in an official CIA statement!I'm still so "dazed and confused" that I can't tell this (or anything)
is right...Thanks very much for so extensive and sound explanation,
Jarek P.
PS: Btw, I apologize Helge for not trusting her: "verification by
testing would not be trivial" words.
-
When Intel first added speculative loads to the x86 family, they pegged the
speculative load to the cache line. If the cache line is invalidated, so is
the speculative load. As a result, out-of-order reads to normal memory are
invisible to software. If a write to the same memory location on another CPU
would make the fetched value invalid, it will make the cache line invalid,
which invalidates the fetch.I think it's extremely unlikely that any x86 CPU will do this any
differently. It's hard to imagine Intel and AMD would go to all this trouble
for so long just to stop so late in the line's lifetime.DS
-
I'd say that's exactly what Intel wanted. It's pretty common (we do
it all the time in the kernel too) to create an API which places a
stronger requirement on the caller than is actually required. It can
make changes much less painful.Has performance really been much problem for you? (even before the
lfence instruction, when you theoretically had to use a locked op)?
I mean, I'd struggle to find a place in the Linux kernel where there
is actually a measurable difference anywhere... and we're pretty
performance critical and I think we have a reasonable amount of lockless
code (I guess we may not have a lot of tight computational loops, though).
I'd be interested to know what, if any, application had found theseThe thing is that those documents are not defining what a particular
implementation does, but how the architecture is defined (ie. what
must some arbitrary software/hardware provide and what may it expect).It's pretty natural that Intel started out with a weaker guarantee
than their CPUs of the time actually supported, and tightened it up
after (presumably) deciding not to implement such relaxed semantics
for the forseeable future.-
On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:
I'm not performance-words at all, so I can't help you, sorry. But, I
understand people who care about this, and think there is a popular
conviction barriers and locked instructions are costly, so I'mI'm not sure this is the right way to tell it. If there is no
distinction between what is and what could be, how can I believe in
similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
like they describe some real hardware mechanisms, and sometimes they
mention about possible changes and reserved features too. So, whenAs a matter of fact it's not natural for me at all. I expected the
other direction, and I still doubt programmers' intentions could be
"automatically" predicted good enough, so IMHO, it's not for long.
Of course, it doesn't seem to be any help for linux or bsd
programmers, which still have to think about different architectures.Regards,
Jarek P.
-
It's more expensive than nothing, sure. However in real code, algorithmic
complexity, cache misses and cacheline bouncing tend to be much bigger
issues.I can't think of a place in the kernel where smp_rmb matters _that_ much.
seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
lfence is not going to hurt. Maybe we even gain 0.01% performance in
someone's workload.Also, remember: if loads are already in-order, then lfence is a noop,
right? (in practice it seems to have to do a _little_ bit of work, butNo. Why are you reading that much into it? I know for a fact that some
non-x86 architectures actual implementations have stronger ordering than
their ISA allows. It's nothing to do with you "believing" how the hardwareReally? Consider the consequences if, instead of releasing this latest
document tightening consistency, Intel found that out of order loads
were worth 5% more performance and implemented them in their next chip.
The chip could be completely backwards compatible, but all your old code
would break, because it was broken to begin with (because it was outside
the spec).IMO Intel did exactly the right thing from an engineering perspective,
and so did Linux to always follow the spec.
-
You are right: considering current CPUs there could be no performance
problem at all. Removing LOCKs for older ones should probably matter
more, but as a matter of fact, now I wouldn't bet even on this - itI've different opinion on this: I expect any spec to describe current
implementation. Before issuing new models any changes of
implementation should be made public with proper margin of time. Then
system could be optimally adjusted to a real hardware, instead of
planned only, but possibly never realized (plus doing such not used
things with old means is usually more costly: lock vs. lfence). There
is still problem of specs' completness: there are probably often some
things unspecified which could brake on a new model, so never 100%But, if you follow the spec - you don't follow the spec! Why do you
ignore so much this part of Intel's spec:"This document contains information which Intel may change at any
time without notice. Do not finalize a design with this information."Maybe it's a real Intel intention and not for lawyers only? (Btw, it
seems we have an example.)Regards,
Jarek P.
-
what you don't realize is that Intel (and AMD) have built their business
on makeing sure that their new CPU's run existing software with no
modifications, (and almost always faster then the old versions). remember
that for most of the world, getting the software modified would mean
buying a new version, if the vendor bothered to make a different version
for the new chip.if they required everyone to buy new software to use a new chip it
wouldn't work well. In fact Intel tried to do exactly withat with the
itanium and it has been a spectacular failure (or t the very least, not ain theory they could change anything at any time, in practice if they
break old software they won't sell the chips, so the modifications tend to
be along the lines of this one, adding detail to the specifications so
that programmers can get more performance.David Lang
-
On Tue, Oct 16, 2007 at 02:14:17AM -0700, david@lang.hm wrote:
It's a good point to always consider when you analyze how something
new should work if it's used with older programs too. But with newer
things like SMP or multithreading they probably have more choice, andThe failure of an architecture doesn't mean all specific new
technologies used in itanium were failure too, so they could be back
when needed (and nothing better in reserve) yet.I don't think 'not breaking' is much problem here, rather how to use
all new features (which you seem to ignore a bit) to get maximum of
performance without breaking older things. Or, like current problem:
go rational and remove useless (acording to new specs) things, even
without performance gain, or stay 'safe'?Jarek P.
-
On Mon, Oct 15, 2007 at 11:09:59AM +0200, Jarek Poplawski wrote:
...performance-wards?!
Looks like serious: I don't even now who I'm not now!
Jarek P.
-
I'm not sure exactly what the situation is with the manufacturers,
but maybe they (at least Intel) wanted to keep their options open
WRT their barrier semantics, even if current implementations wereI don't know quite what you're saying... the CPUs could probably get
performance by having weakly ordered loads, OTOH I think the Intel
ones might already do this speculatively so they appear in order but
essentially have the performance of weak order.If you're just talking about this patch, then it probably isn't much
performance gain. I'm guessing you'd be lucky to measure it fromI don't know if that would be worthwhile. It actually isn't always
trivial to trigger reordering. For example, on my dual-core core2,
in order to see reads pass writes, I have to do work on a set that
exceeds the cache size and does a huge amount of work to ensure it
is going to trigger that. If you can actually come up with a test
case that triggers load/load or store/store reordering, I'm sure
Intel / AMD would like to see it ;)All existing processors as far as we know are in-order WRT loads vs
loads and stores vs stores. It was just a matter of getting the docs
clarified, which gives us more confidence that we're correct and a
reasonable guarnatee of forward compatibility.So, I think the plan is just to merge these 3 patches during the
current window.-
I meant: if there is any reordering possible this should be quite
distinctly visible, because why would any vendor enable such nasty
things if not for performance. But now I start to doubt: of course
there is such a possibility someone makes this reordering for some
other reasons which could be so rare it's hard to check. And this
someone knows it's processors are seen less efficient because of eg.No, it's only about the comment to this patch: "Hence, smp_rmb() may be
Anyway, it seems any heavy testing such as yours, should give us the
same informations years earlier than any vendors manual and then any
gain is multiplied by millions of users. Then only still doubtful
cases could be treated with additional caution and some debuggingAfter reading this Intel's legal information I don't think you should
And they really should be!
Jarek P.
-
It's not. Not in the cases where it is explicitly allowed and actively
exploited (loads passing stores), but most definitely not distinctlyYes: it isn't the explicitly allowed reorderings that we care
about here (because obviously we're retaining the barriers for those).
It would be cases of bugs in the CPUs meaning they don't follow the
standard. But how far do you take your mistrust of a CPU? You could
ask gcc to insert locked ops between every load and store operation?Firstly, while it can be possible to write a code to show up reordering,
it is really hard (ie. impossible) to guarantee no reordering happens. For
example, it may have only showed up on SMT+SMP P4 CPUs with some obscure
interactions between threads and cores involving more than 2 threads.Secondly, even if we were sure that no current implementations reordered
loads, we don't want to go outside the bounds of the specification
because we might break on some future CPUs. This isn't a big performanceYes, but that's the same way I feel after reading *any* legal "information" ;)
-
I'm not sure of your point, but it seems we don't differ here, and
I'm not sure how much this all above is consistent wrt. this earlier
It seems, after testing only (plus no official spec against this idea),
you could be almost sure there is no such test possible. And, if it
were done a few years ago, you think it still should be not enough to
make a decision on changing this smp_rmb because of lack of official
specs? Besides, there is probably so much features guessing in arch
and drivers sections, this reorder testing should look as solid as aI don't agree with this - IMO we should care only about currently used
Strange... I feel exactly opposite. Are you sure you've chosen the
right job (...and the right system)?Jarek P.
-
(...plus of course proper smp_rmb & smp_wmb vs. smp_mb interpretation
probably available from Paul McKenney or Davide Libenzi before this
Intel spec, as well...)Jarek P.
-
You could have tried the optimization before, and
gotten better performance. But if without solid knowledge that
the optimization is _valid_, you risk having a kernel
that performs great but suffer the occational glitch and
therefore is unstable and crash the machine "now and then".
This sort of thing can't really be figured out by experimentation, because
the bad cases might happen only with some processors, some
combinations of memory/chipsets, or with some minimum
number of processors. Such problems can be very hard
to find, especially considering that other plain bugs also
cause crashes.Therefore, the "ineffective code" was used because it was
the only safe alternative. Now we know, so now we may optimize.Helge Hafting
-
Sorry, I don't understand this logic at all. Since bad cases
happen independently from any specifications and Intel doesn't
take any legal responsibility for such information, it seems we
should better still not optimize?Jarek P.
-
The point is that we _trust_ intel when they says "this will work".
Therefore, we can use the optimizations. It was never about
legal matters. If we didn't trust intel, then we couldn't
use their processors at all.We couldn't take the chance before. It was not documented
to work, verification by testing would not be trivial at all for
this case.
Linux is about "stability first, then performance".
Now we _know_ that we can have this optimization without
compromising stability. Nobody knew before!Helge Hafting
-
On Fri, Oct 12, 2007 at 02:44:51PM +0200, Helge Hafting wrote:
But there was nothing about trust. Usually you don't trust somebody
but somebody's opinions. The problem is there was no valid opinion,So, you think this would be the first or the least credibly
verified undocumented feature used in linux? Then, it seems
I can try to install this linux on my laptop at last! (...
And, I can trust you, it will not break anything...?)Thanks,
Jarek P.
-
"Trusting people or their opinions" is only about use of the
english language, and not that intersting to bring up here.
Surely you know that lots of people here have english as
a secondary language only. Intersting for me to know, but
I never claimed that linux will work on your laptop, so no:
You can't take my word for that, because I never gave it!
It is well known that some laptops don't work with linux,
I have no idea if yours will work, I don't even know what kind it is.I told you the reasoning behind using _this particular optimization_,
the same does _not_ apply to everything else. If you think every
kernel decision is made the same way, then you are mistaken.
Things don't work that way.
First, several people are involved - they think differently.
Second, "what kind of tricks to use" is not an all-or-nothing
approach. If linux were to use every undocumented trick
that might or might not work, then linux would fail on
lots of hardware. It would not be useful.
If linux took the other approach and never used any "tricks",
then it'd be slow and boring.Some things are much easier to test - you construct a testcase
or just build a test kernel and benchmark it. If all is ok, then
the "trick" is useable. Some cases are a clear win for lots of
machines, and the possible failure cases involves
very rare hardware. So it might get used. Some tricks have
a failure mode that is rare but completely obvious when it happens.
So it gets used, and "troublesome hardware" is added to a blacklist
as needed.Some "tricks" however, are hard to figure out without docs.
There may be no good way to test. The tricks
may cause instability that will be very hard to track down, and this could
happen on a wide range of hardware. So such don't get used, until
adequate documentation appear. In this case, it seems like intel,
who make and design the processors in question and therefore
know them well enough, provided such documentation. That
makes a previously dubious optimization safe.Helge H...
Of curse, I know this problem: sometimes it's very hard to make people
believe it's my secondary language! But this time I didn't see any
language problem. I simply poined out that sometimes trusting could beOK, this was supposed to be a joke... (Btw, can you remember burning
linux laptops?) I thought this "stability first" a bit funny, but this
was a really bad joke, sorry.Thanks for these additional explanations - you are completely right!
Regards,
Jarek P.
-
We already do in probably more critical and lible to be problematic
cases (notably, spin_unlock).So unless there is reasonable information for us to believe this
will be a problem, IMO the best thing to do is stick with the
specs. Intel is pretty reasonable with documenting errata I think.With memory barriers specifically, I'm sure we have many more bugs
in the kernel than AMD or Intel have in their chips ;)-
On Fri, Oct 12, 2007 at 11:44:27AM +0200, Nick Piggin wrote:
100% right - if there are any specs. But it seems for a few years
this spec was missing or there is some change of mind, I presume?Jarek P.
-
wmb() on x86 must always include a barrier, because stores can go out of
order in many cases when dealing with devices (eg. WC memory).Signed-off-by: Nick Piggin <npiggin@suse.de>
Index: linux-2.6/include/asm-i386/system.h
===================================================================
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -216,6 +216,7 @@ static inline unsigned long get_limit(un#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
+#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)/**
* read_barrier_depends - Flush all pending reads that subsequents reads
@@ -271,18 +272,14 @@ static inline unsigned long get_limit(un#define read_barrier_depends() do { } while(0)
-#ifdef CONFIG_X86_OOSTORE
-/* Actually there are no OOO store capable CPUs for now that do SSE,
- but make it already an possibility. */
-#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
-#else
-#define wmb() __asm__ __volatile__ ("": : :"memory")
-#endif
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif
#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
#else
Index: linux-2.6/include/asm-x86_64/system.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -159,12 +159,8 @@ static inline void write_cr8(unsigned lo
*/
#define mb() asm volatile("mfence":::"memory")
#define rmb() asm volatile("lfence":::"memory")
-
-#ifdef CONFIG_UNORDERED_IO
#define wmb() asm volatile("sfence" ::: "memory")
-#else
-#define wmb() asm volatile("...
On Thu, Oct 04, 2007 at 07:22:58AM +0200, Nick Piggin wrote:
> -#ifdef CONFIG_X86_OOSTORE
> -/* Actually there are no OOO store capable CPUs for now that do SSE,
> - but make it already an possibility. */
> -#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
> -#else
> -#define wmb() __asm__ __volatile__ ("": : :"memory")
> -#endif
> -
> #ifdef CONFIG_SMP
> #define smp_mb() mb()
> #define smp_rmb() rmb()
> -#define smp_wmb() wmb()
> +#ifdef CONFIG_X86_OOSTORE
> +# define smp_wmb() wmb()
> +#else
> +# define smp_wmb() barrier()
> +#endifThe only vendor that ever implemented OOSTOREs was Centaur, and they
only did in the Winchip generation of the CPUs. When they dropped it
from the C3, I asked whether they intended to bring it back, and the
answer was "extremely unlikely".So we can probably just drop that "just in case" clause above, and just
do..#define smp_wmb() barrier()
Dave
Do you know if it made a big performance difference?
But yes we should probably just remove this special case to make
maintenance easier.-Andi
-
On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
>
> > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > only did in the Winchip generation of the CPUs. When they dropped it
> > from the C3, I asked whether they intended to bring it back, and the
> > answer was "extremely unlikely".
> >
>
> Do you know if it made a big performance difference?On the winchip, it was a huge win. I can't remember exact numbers,
but pretty much every benchmark I threw at it at the time showed
significant improvement.> But yes we should probably just remove this special case to make
> maintenance easier.It's CONFIG_SMP anyway, which none of the winchips were.
SMP+OOSTORE just didn't happen, and I'd be surprised if
any vendor makes it happen any time soon.
(Even if so, it's likely we'd need to make additional changes
anyway, so adding it back shouldn't be a big deal.)Dave
It's not. And we need memory barriers even without SMP
when talking to device drivers. Only the smp_*b()s get noped
on UP.-Andi
-
On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
> On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> > >
> > > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > > only did in the Winchip generation of the CPUs. When they dropped it
> > > > from the C3, I asked whether they intended to bring it back, and the
> > > > answer was "extremely unlikely".
> > > >
> > >
> > > Do you know if it made a big performance difference?
> >
> > On the winchip, it was a huge win. I can't remember exact numbers,
> > but pretty much every benchmark I threw at it at the time showed
> > significant improvement.
>
> Significant as in >10%?"Worth about 10-20% performance" according to the 2.4.18pre9-ac4
release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN> > > But yes we should probably just remove this special case to make
> > > maintenance easier.
> > It's CONFIG_SMP anyway, which none of the winchips were.
>
> It's not.You're right it isn't now, but Nicks patch seems to change it so that it is.
...
#ifdef CONFIG_SMP
#define smp_mb() mb()
#define smp_rmb() rmb()
-#define smp_wmb() wmb()
+#ifdef CONFIG_X86_OOSTORE
+# define smp_wmb() wmb()
+#else
+# define smp_wmb() barrier()
+#endif> And we need memory barriers even without SMP
> when talking to device drivers. Only the smp_*b()s get noped
> on UP.Good point.
Dave
That is only for smp_wmb() which are always SMP only
-Andi
-
On Thu, Oct 04, 2007 at 08:58:27PM +0200, Andi Kleen wrote:
> On Thursday 04 October 2007 20:41:07 Dave Jones wrote:
> > On Thu, Oct 04, 2007 at 08:21:59PM +0200, Andi Kleen wrote:
> > > On Thursday 04 October 2007 20:10:44 Dave Jones wrote:
> > > > On Thu, Oct 04, 2007 at 07:53:16PM +0200, Andi Kleen wrote:
> > > > >
> > > > > > The only vendor that ever implemented OOSTOREs was Centaur, and they
> > > > > > only did in the Winchip generation of the CPUs. When they dropped it
> > > > > > from the C3, I asked whether they intended to bring it back, and the
> > > > > > answer was "extremely unlikely".
> > > > > >
> > > > >
> > > > > Do you know if it made a big performance difference?
> > > >
> > > > On the winchip, it was a huge win. I can't remember exact numbers,
> > > > but pretty much every benchmark I threw at it at the time showed
> > > > significant improvement.
> > >
> > > Significant as in >10%?
> >
> > "Worth about 10-20% performance" according to the 2.4.18pre9-ac4
> > release notes: http://www.linuxtoday.com/news_story.php3?ltsn=2002-02-14-015-20-NW-KN
>
> Are there numbers for a newer kernel available too?no idea, my winchips died about 5 years ago.
Dave
Got a couple here just need a mainboard 8)
-
| FUJITA Tomonori | Re: Integration of SCST in the mainstream Linux kernel |
| Oleg Verych | Re: [PATCH] trivial: the memset operation on a automatic array variable should be ... |
| Ingo Molnar | Re: AIM7 40% regression with 2.6.26-rc1 |
| Jeremy Fitzhardinge | Re: [RFC] Heads up on sys_fallocate() |
git: | |
| Sander | 'struct task_struct' has no member named 'mems_allowed' (was: Re: 2.6.20-rc4-mm1) |
| Corey Minyard | [PATCH 3/3] Convert the UDP hash lock to RCU |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [GIT]: Networking |
| Stephen Hemminger | Re: [RFC] addition of a dropped packet notification service |
