Yes, it should be able to process this in a register, instead of storing
to the stack and then merging later.
It's possible it's trying to do that to hide latency, but that's clearly
a lose in this case.
I'd hate to obfuscate the code, and I'd *really* hate to obfuscate the
code to work around gcc brokenness, and then not even bothering to tell
the gcc folks.
-hpa
--