On Thu, 5 Jun 2008, Nick Piggin wrote:That's _one_ possible implementation. Quite frankly, I think it's the less likely one. It's much more likely that the cache read access and the store buffer probe happen in parallel (this is a really important hotpath for any CPU, but even more so x86 where there are more of loads and stores that are spills). And then the store buffer logic would return the data and a bytemask mask (where the mask would be all zeroes for a miss), and the returned value is just the appropriate mix of the two. You'd have to ask somebody very knowledgeable inside Intel and AMD, and it is quite likely that different microarchitectures have different approaches... Oh, absolutely, the perfect algorithm would actually get the right answer and notice that the cacheline got evicted, and retried the whole sequence such that it is coherent. But we do know that Intel expressly documents loads and stores to pass each other and documents the fact that the store buffer is there. So I bet that this is visible in some micro-architecture, even if it's not necessarily visible in _all_ of them. The recent Intel memory ordering whitepaper makes it very clear that loads can pass earlier stores and in particular that the store buffer allows intra-processor forwarding to subsequent loads (2.4 in their whitepaper). It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure it isn't. The store->load forwarding is such a critical performance issue that I can pretty much guarantee that it doesn't always hit the cacheline. Of course, the partial store forwarding case is not nearly as important, and stalling is quite a reasonable implementation approach. I just personally suspect that doing the unconditional byte-masking is actually _simpler_ to implement than the stall, so.. Linus --
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jesper Krogh | Re: Linux 2.6.26-rc4 |
| Thomas Gleixner | Re: Linux 2.6.21-rc1 |
| Hugh Dickins | Re: [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
git: | |
| Antonio Almeida | HTB accuracy for high speed |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
