On Wed, Mar 31, 2010 at 11:54:30AM -0400, Vivek Goyal wrote:
Yes, the problem is (as I understand it) is that the triggering of DMA
operations to/from a device doesn't have synchronization with the iommu itself.
I.e. to conduct a dma you have to:
1) map the in-memory buffer to a dma address using something like
pci_map_single. This results (in systems with an iommu) getting page table
space allocated in the iommu for the translation.
2) triggering the dma to/from the device by tickling whatever hardware the
device has mapped.
3) completing the dma by calling pci_unmap_single (or other function) which
frees the page table space in the iommu
The problem, exactly as you indicate is that on a kdump panic, we might boot the
new kernel and re-enable the iommu with these dmas still in flight. If we start
messing about with the iommu page tables then, we start getting all sorts of
errors, and other various failures.
Yeah, thats a solution, but it seems awfully complex to me. To do that, we need
to teach every iommu we support about kdump, by telling it how much space to
reserve, and when to use it and when not to (i.e. we'd have to tell it to use
the kdump space, vs the normal space dependent on the status of the
reset_devices flag, or something equally unpleasant).
Actually, thinking about it, I'm not sure that will even work, as IIRC the iommu
only has one page table base pointer. So we would either need to re-write that
pointer to point into the kdump kernels memory space (invalidating the old table
entries, which perpetuates this bug), or we would need to further enhance the
iommu code to be able to access the old page tables via
read_from_oldmem/write_to_oldmem when booting a kdump kernel, wouldn't we?
Using this method, all we really do is try to ensure that, prior to disabling
the iommu, we make sure that any pending dmas are complete. That way, when we
re-enable the iommu in the kdump kernel, we can safely maniuplate the new page
tables, knowing that no pending dma is using them
In fairness to this debate, my proposal does have a small race condition. In
the above sequence, because the cpu triggers a dma independently of the setup of
the mapping in the iommu, it is possible that a dma might be triggered
immediately after we flush the iotlb, which may leave an in-flight dma pending
while we boot the kdump kernel. In practice though, this will never happen. By
the time we arrive at this code, we've already executed
native_machine_crash_shutdown which:
1) halts all the other cpus in the system
2) disables local interrupts
Because of those two events, we're effectively on a path that we can't be
preempted-from. So as long as we don't trigger any dma operations between our
return from iommu_shutdown and machine_kexec (which is the next call), we're
safe.
It blocks the cpu until any pending DMA operations are complete. Hmm, as I
think about it, there is still a small possibility that a device like a NIC
which has several buffers pre-dma-mapped could start a new dma before we
completely disabled the iommu, althought thats small. I never saw that in my
testing, but hitting that would be fairly difficult I think, since its literally
just a few hundred cycles between the flush and the actual hardware disable
operation.
According to this though:
http://support.amd.com/us/Processor_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf
That window could be closed fairly easily, but simply disabling read and write
permissions for each device table entry prior to calling flush. If we do that,
then flush the device table, any subsequently started dma operation would just
get noted in the error log, which we could ignore, since we're abot to boot to
the kdump kernel anyway.
Would you like me to respin w/ that modification?
Neil
--