I agree that wbinvd() seems to be faster on large arrays on the
processors I've tested. But isn't there a severe latency problem with
that instruction, that makes people really want to avoid it in all
possible cases?
Also I think we need to clarify the semantics of the c_p_a
functionality. Right now both AGP and DRM relies on c_p_a doing an
explicit cache flush. Otherwise the data won't appear on the device side
of the aperture.
If we use self-snoop, the AGP and DRM drivers can't rely on this flush
being performed, and they have to do the flush themselves, and for
non-self-snooping processors, the flush needs to be done twice?
/Thomas
--