Folx,
I am working on a piece of high-performance SW that maintains instrumentation/performance counters. These need to be longs to avoid a too quick of a wraparound.
Most are simple counters, most (90%) will be incremented from no more than one thread. Most will be incremented 99.999% of the time and only read once in a blue moon - from a different thread.
What is the best way to implement these w/o incurring too much of synchronization overhead ?
Assuming I don't synchronize at all, what would be the worst outcome ?
I am OK loosing an occasional increment (if that's the worst that can happen).
If I were to use 64b atomics, what kind of penalty will there be ? Again, most ops, for any given counter, are coming from one thread.
alignment
I hope your counters are aligned to cache lines? if multiple counters share a cache line, this will bounce between caches.
i386 successor architectures have very strong ordering (because they have to be compatible to old code). maybe there is no problem at all, as long as you don't try to port/recompile your code to another architecture. You just have to tell the compiler to not only update a copy (in a register) but the global variable, by using appropriate memory barriers/clobbers or volatiles.
If the counter stays in cache, it is not unthinkable that it is never written back to RAM, if you don't use explicit or implicit hardware memory barriers (like the kernel doing I/O in an interrupt).
To play safe, you could use local only counters that spill to global counters periodically (every 256 updates or so). The global counters could be per cpu as well but with proper barriers to be readable from outside. That way you can have strict error bounds.