>> >> I'm using Spansion MirrorBit S29GL128N, which reads at about 0.6 MByte/s.
S29_GL_128N
Well the first read takes 100ns (plus the other chipset overhead
300ns) but other reads in a page are only an extra 25ns each. So your
benefit is not from having the entire executable in cache it's from
having the next 7 instructions in the cacheline for only an extra 25ns
each instead of 400ns.
Usually these things can be fixed in the bootloader or by hacking the
kernel to tweak the relevant chipset registers.
--