Hi Avi, I've purposed the following API for mask bit support. The main point is, QEmu can know which entries are enabled(by pci_enable_msix()). And for enabled entries, kernel own it, including MSI data/address and mask bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with them if it want to do so). Before entries are enabled, QEmu can still use it's own MSI table(because we didn't contain these kind of information in kernel, and it's unnecessary for kernel). The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry didn't exist in kernel - or we can simply return -EINVAL for it. I suppose it would be rare for QEmu to use this interface to get the context of entry(the only case I think is when MSI-X disable and QEmu need to sync the context), so performance should not be an issue. -- regards Yang, Sheng --
Why is there a need for the flag? If we simply get/set entire entries, that includes the mask bits? Isn't the mask bit in the last word? Or maybe I'm confused about the Also need a new exit reason to tell userspace that an msix entry has changed, so userspace can update mappings. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
We still want QEmu to cover a part of entries which hasn't been enabled yet(which won't existed in routing table), but kernel would cover all mask bit regardless of if it's enabled. So QEmu can query any entry to check the maskbit, but not We didn't cover it here - and it's in another MMIO space(PBA). Of course we can I think we don't need it. Whenever userspace want to get one mapping which is an enabled MSI-X entry, it can check it with the API above(which is quite rare, because kernel would handle all of them when guest is accessing them). If it's a disabled entry, the context inside userspace MMIO record is the correct one(and only one). The only place I think QEmu need to sync is when MSI-X is about to disabled, QEmu need to update it's own MMIO record. -- regards Yang, Sheng --
Don't understand. If we support reading/writing entire entries, that When an entry is masked, we need to set the pending bit for it somewhere. I guess this is broken in the existing code (without your So in-kernel handling of mmio would be decided per entry? I'm trying to simplify this, and simplest thing is - all or nothing. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Even with my patch, we didn't support the pending bit. It would always return 0 now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and return the result to userspace. That would involve some core change, like to export irq_to_desc(). I don't think So you would like to handle all MSI-X MMIO in kernel? -- regards Yang, Sheng --
The API needs to be compatible with the pending bit, even if we don't Yes. Writes to address or data would be handled by: - recording it into the shadow msix table - notifying userspace that msix entry x changed Reads would be handled in kernel from the shadow msix table. So instead of - guest reads/writes msix - kvm filters mmio, implements some, passes others to userspace we have - guest reads/writes msix - kvm implements all - some writes generate an additional notification to userspace -- error compiling committee.c: too many arguments to function --
One small proposal in addition: since all accesses are done from guest anyway, the shadow table can/should be stored using userspace memory, reducing the kernel memory overhead of the feature from up to 4K per MSIX table to just 8 bytes. --
I am not sure about it. But I suppose the structure should be the same? In fact it's pretty hard for me to image what's needed for virtio in the future, especially there is no such code now. I really prefer to deal with assigned device and virtio separately, which would make the work much easier. But seems you won't This can be implemented by this API, just adding a flag for it. And I would still I suppose we don't need to generate notification to userspace? Because every read/write is handled by kernel, and userspace just need interface to kernel to get/set the entry - and well, does userspace need to do it when kernel can handle all of them? Maybe not... -- regards Yang, Sheng --
First, I don't really see why the two cases are different (but I don't do a lot in this space). Surely between you and Michael, you have all the information? Second, my worry is a huge number of ABI variants that come from incrementally adding features. I want to implement bigger chunks of functionality. So I'd like to see all potential users addressed, at We could have the kernel handle addr/data writes by setting up an internal interrupt routing. A disadvantage is that more work is needed if we emulator interrupt remapping in qemu. -- error compiling committee.c: too many arguments to function --
As an alternative, interrupt remapping will need some API rework, right? Existing APIs only pass address/data for msi. -- MST --
IIRC interrupt remapping works with address/data to. It just interpret it differently from apic. -- Gleb. --
Yes. So since our APIs use address/data, this is an argument for doing --
Of course KVM should service reading from pending bitmask. For assigned device, it's kernel who would set the pending bit; but I am not sure for virtio. This In fact modifying irq routing in the kernel is also the thing I want to avoid. So, the flow would be: kernel get MMIO write, record it in it's own MSI table KVM exit to QEmu, by one specific exit reason QEmu know it have to sync the MSI table, then reading the entries from kernel QEmu found it's an write, so it need to reprogram irq routing table using the entries above done But wait, why should qemu read entries from kernel? By default exit we already have the information about what's the entry to modify and what to write, so we can use them directly. By this way, we also don't need an specific exit reason - just exit to qemu in normal way is fine. Then it would be: kernel get MMIO write, record it in it's own MSI table KVM exit to QEmu, indicate MMIO exit QEmu found it's an write, it would update it's own MSI table(may need to query mask bit from kernel), and reprogram irq routing table using the entries above done Then why should kernel kept it's own MSI table? I think the only reason is we can speed up reading in that way - but the reading we want to speed up is mostly on enabled entry(the first entry), which is already in the IRQ routing table... And for enabled/disabled entry, you can see it like this: for the entries inside routing table, we think it's enabled; otherwise it's disabled. Then you don't need to bothered by pci_enable_msix(). So our strategy for reading accelerating can be: If the entry contained in irq routing table, then use it; otherwise let qemu deal with it. Because it's the QEmu who owned irq routing table, the synchronization is guaranteed. We don't need the MSI table in the kernel then. And for writing, we just want to cover all of mask bit, but none of others. I think the concept here is more acceptable? The issue here is MSI table and irq routing table ...
Avi? And BTW, we can take routing table as a kind of *cache*, if the content is in the cache, then we can fetch it from the cache, otherwise we need to go back to fetch it from memory(userspace). -- regards --
The kernel should manage it in the same way. Virtio raises irq (via KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. Note we need to be able to read and write the pending bitmask for live Because we have an interface where you get an exit if (addr % 4) < 3 and don't get an exit if (addr % 4) == 3. There is a gpa range which is partially maintained by the kernel and partially in userspace. It's a confusing interface. Things like 64-bit reads or writes need to be broken up and serviced in two different places. We already need to support this (for unaligned writes which hit two The reason is to keep a sane interface. Like we emulate instructions and msrs in the kernel and don't do half a job. I don't think there's a I agree about letting qemu manage the irq routing table. It changes very rarely. I just prefer to let it know about the change via If it's guaranteed by the spec that addr/data pairs are always interpreted in the same way, sure. But there no reason to do it, really, it isn't a fast path. -- error compiling committee.c: too many arguments to function --
Then seems we still need to an writing interface for it. And I think we can work
Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
illegal according to the PCI spec(otherwise the result is undefined according to
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
version, and discard all illegal writing. And 64-bit accessing would be broken up
in qemu as you said, as they do currently.
In fact I think we can handle all data for 64-bit to qemu, because it should still
sync the mask bit with kernel, which make the maskbit writing in userspace useless
Here is the case we've observed with Xen. It can only be reproduced by large scale
testing. When the interrupt intensity is very high, even new kernels would try to
make it lower, then it would access mask bit very frequently. And in the kernel,
msi_set_mask_bit() is like this:
static void msi_set_mask_bit(struct irq_data *data, u32 flag)
{
struct msi_desc *desc = irq_data_get_msi(data);
if (desc->msi_attrib.is_msix) {
msix_mask_irq(desc, flag);
readl(desc->mask_base); /* Flush write to device */
} else {
unsigned offset = data->irq - desc->dev->irq;
msi_mask_irq(desc, 1 << offset, flag << offset);
}
}
That flush by readl() would complied with each MSI-X mask writing. That is the only
place we want to accelerate, otherwise it would still fall back to qemu each time
I am ...I'm not sure. Suppose a guest starts using pending bits. Does it mean it will not work with kernel msix acceleration? If that is the case, then we need pending bit support now. I don't want to knowingly merge something that doesn't conform to the spec, forcing So it seems we do want to accelerate reads to the entire entry. This Which case? the readl() doesn't need access to the routing table, just the entry. Oh, I think there is a terminology problem, I was talking about kvm's irq routing table, you were talking about the msix entries. I think treating it as a cache causes more problems, since there are now two paths for reads (in cache and not in cache) and more things for writes to manage. Here's my proposed API: KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, pending_bitmap_base_gpa) - called when the guest enables msix KVM_REMOVE_MSIX_TABLE(table_id) - called when the guest disables msix KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest enables msix (to initialize it), or after live migration KVM_GET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest updates an msix table (triggered by KVM_EXIT_MSIX_ENTRY_UPDATED) KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED - returned from KVM_RUN after the guest touches the first three words of an msix entry KVM_SET_MSIX_ENTRY_IRQ(table_id, entry_id, msix_irq_desc) msix_irq_desc could describe an assigned device irq, or gsi that is mapped to an msix interrupt in the kvm routing table. Michael? I think that should work for virtio and vfio assigned devices? Not sure about pending bits. -- error compiling committee.c: too many arguments to function --
One thing that read should do is flush in the outstanding I would add virtual addresses so that we can use swappable memory to store the state. If we do, maybe we can just keep the table there and then Pending bits must be tracked in kernel, but I don't see how we can support polling mode if we don't exit to userspace on pending bit reads. This does mean that some reads will be fast and some will be slow, and it's a bit sad that we seem to be optimizing for specific guests, but I just can't come up with anything better. So maybe just add an ioctl to get and to clear pending bits. --
The mask bit writes are synchronous. wrt interrupts, we can deal with assigned devices, and can poll irqfds. But we can't force vhost-net to issue an interrupt (and I don't think Still need to to let userspace know it needs to reprogram the irqfd or If the pending bits live in userspace memory, the device model can For live migration too. But if they live in memory, no need for get/set, just specify the address. -- error compiling committee.c: too many arguments to function --
To clarify: mask write read it is safe for guest to assume no more interrupts where as with a simple mask write Why do we need to reprogram irqfd? I thought irqfd would map to an entry within the table instead of address/data as now. So I think KVM_DEFINE_MSIX_TABLE should be called when msix is enabled (note: it can not be called at boot anyway since pa Note that these are updated on an interrupt, so updating them in userspace would need get_user_page etc trickery, and add the overhead of atomics. Further I think it's important to avoid the overhead of updating them all the time, and only do this when an interrupt is masked or on pending bits read. Since userspace does not know --
Don't agree. MMIO can be write regardless of if MSIX is enabled. If you handle MMIO to kernel, them handle them all. I suppose qemu still In fact qemu's accessing to MMIO should be quite rare after moving all the things to the kernel. Using IOCTL is also fine with me. And how to "do update on each read"? -- regards, --
Hmm, does this mean we would need ioctls to enable/disable MSIX? I think passing control from userspace to kernel when MSIX is enabled is Yes. So e.g. if memory is disabled in the device then we must disable this as well, and not claim these transactions current place? At the moment qemu does not notify devices when memory is disabled, only when it is enabled. When read of pending bits is detected, we could forward it up to qemu. Qemu could check internal device state and clear bits that are no longer --
Anyway we need ioctls to do so because kernel knows nothing about PCI configuration You mean bit 2 of Command register? I think suppose device would only disable memory when shut down should be acceptable. Also we can hook at disable point as Don't agree. Then you got duplicate between kernel and userspace. Also the Don't understand why we need turn to qemu when everything is ready in kernel. And pending bit is still in the scope of PCI spec. Also it's can only be written by the kernel who emulate the table. Of course, device model can read from it. -- regards --
Unfortunately, it can't I think, unless all your guests are linux. "enabled entries" is a linux kernel concept. The MSIX spec only tells you which entries are masked and which are unmasked. -- MST --
Can't understand what you are talking about, and how it related to the guest OS. I was talking about pci_enable_msix() in the host Linux. -- regards Yang, Sheng --
