On Thu, 14 Aug 2008, Mathieu Desnoyers wrote:Absolutely. Locked ops show up not just in microbenchmarks looping over the instruction, they show up in "real" benchmarks too. We added a single locked instruction (maybe it was two) to the page fault handling code some time ago, and the reason I noticed it was that it actually made the page fault cost visibly more expensive in lmbench. That was a _single_ instruction in the hot path (or maybe two). And the page fault path is some of the most timing critical in the whole kernel - if you have everything cached, the cost of doing the page faults to populate new processes for some fork/exec-heavy workload (and compiling the kernel is just one of those - any traditional unix behaviour will show this) is critical. This is one of the things AMD does a _lot_ better than Intel. Intel tends to have a 30-50 cycle cost (with later P4s being *much* worse), while AMD tends to have a cost of around 10-15 cycles. It's one of the things Intel promises to have improved in the next-gen uarch (Nehalem), an while I am not supposed to give out any benchmarks, I can confirm that Intel is getting much better at it. But it's going to be visible still, and it's really a _big_ issue on P4. (Of course, on P4, the page fault exception cost itself is so high that the cost of atomics may be _relatively_ less noticeable in that particular path) Linus --
| debian developer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Antonio Almeida | HTB accuracy for high speed |
| David Miller | [GIT]: Networking |
