On Fri, Sep 28, 2007 at 09:15:06AM -0700, Linus Torvalds wrote:Actually, in a userspace test I have (which actually does enough work to trigger out of order operations on my core2 but is otherwise pretty trivial), lfence takes 13 cycles, sfence takes 40 (neither of which actually solve the problem of load vs store ordering, but at least they might be operating on a slightly utilised memory subsystem rather than the stupidest possible microbenchmark). The dummy lock op takes around 75 cycles (of course, the core2 would always use the fences, but older CPUs will not and will be worse at the lock op too, probably). I suppose these could take significantly longer if there are uncached memory operations and such (I wasn't doing any significant amount of IO) -- I can't be sure, though. So it isn't much, but it could be helpful. If the code is important enough to go without locks and instead use complex barriers, it might easily be worth saving this kind of cycles on. Will add, I'll ask Alan to specify what he'd like to see there. Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable RAM apparently can go out of order too, and they are being used in the kernel for some things. Likewise for rep stos, apparently. But this means they are already at odds with spin_unlock, unless they are enclosed with mfences everywhere they are used (of which I think most are not). So this is an existing bug in the kernel. So again the question comes up -- do we promote these kinds of stores to be regular x86 citizens, keep the strong memory barriers as they are, and eat 40 cycles with an sfence before each spin_unlock store; or do we fix the few users of non-temporal stores and continue with the model we've always had where stores are in-order? Or I guess the implicit option is to do nothing until some poor bastard has the pleasure of having to debug some problem. Anyway, just keep in mind that this patch is not making any changes which are not already fundamentally broken. Sure, it might happen to cause more actual individual cases to break, but if they just happened to be using real locking instead of explicit barriers, they would be broken anyway, right? (IOW, any new breakage is already conceptually broken, even if OK in practice due to the overstrictness of our current barriers). -
| Greg KH | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Andy Whitcroft | clam |
| Ingo Molnar | [patch] paravirt: VDSO page is essential |
git: | |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Lovich, Vitali | RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING |
| David Miller | [GIT]: Networking |
