btw in case you were thinking a normal store to WB rather than a
non-temporal store... i ran a microbenchmark streaming stores to every 16
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles
whereas the avg latency of MOVAPS is 20 cycles.
the inner loop is unrolled 16 times so there are literally 4 cache lines
worth of stores being stuffed into the store queue as fast as possible...
and there's no coalescing for normal stores even on this modern CPU.
i'm certain i'll see the same thing on AMD... it's a very hard thing to do
in hardware without the non-temporal hint.
-dean
--