If you've actually been reading at all what I've been saying in this
thread you'll see that I've described a method to do this copy
avoidance in a completely stateless manner.
You don't need to implement a TCP stack in the card in order to do
data placement optimizations. They can be done completely stateless.
Also, large portions of the cpu overhead are transactional costs,
which are significantly reduced by existing technologies such as
LRO.
--