From what I have seen a lot of this is to do with the pmap. It's missing a
lot of optimizations that have been made to i386, particularly deferred
address space switching. Threaded programs in particular are much slower on
amd64 for that reason. Once the two pmaps are finally merged then things
should be a lot better.