Re: Mask bit support's API

Previous thread: Performance test result between virtio_pci MSI-X disable and enable by lidong chen on Monday, November 22, 2010 - 7:53 pm. (21 messages)

Next thread: [PATCH trace-cmd 0/3] kvm plugin updates by Avi Kivity on Tuesday, November 23, 2010 - 3:58 am. (5 messages)
From: Yang, Sheng
Date: Monday, November 22, 2010 - 11:09 pm

Hi Avi,

I've purposed the following API for mask bit support.

The main point is, QEmu can know which entries are enabled(by pci_enable_msix()). 
And for enabled entries, kernel own it, including MSI data/address and mask 
bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to 
get them(and it can sync with them if it want to do so).

Before entries are enabled, QEmu can still use it's own MSI table(because we 
didn't contain these kind of information in kernel, and it's unnecessary for 
kernel). 

The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry didn't 
exist in kernel - or we can simply return -EINVAL for it.

I suppose it would be rare for QEmu to use this interface to get the context of 
entry(the only case I think is when MSI-X disable and QEmu need to sync the 
context), so performance should not be an issue.


--
regards
Yang, Sheng                                         
--

From: Avi Kivity
Date: Monday, November 22, 2010 - 11:17 pm

Why is there a need for the flag?  If we simply get/set entire entries, 
that includes the mask bits?


Isn't the mask bit in the last word?  Or maybe I'm confused about the 

Also need a new exit reason to tell userspace that an msix entry has 
changed, so userspace can update mappings.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Yang, Sheng
Date: Monday, November 22, 2010 - 11:35 pm

We still want QEmu to cover a part of entries which hasn't been enabled yet(which 
won't existed in routing table), but kernel would cover all mask bit regardless of 
if it's enabled. So QEmu can query any entry to check the maskbit, but not 

We didn't cover it here - and it's in another MMIO space(PBA). Of course we can 


I think we don't need it. Whenever userspace want to get one mapping which is an 
enabled MSI-X entry, it can check it with the API above(which is quite rare, 
because kernel would handle all of them when guest is accessing them). If it's a 
disabled entry, the context inside userspace MMIO record is the correct one(and 
only one). The only place I think QEmu need to sync is when MSI-X is about to 
disabled, QEmu need to update it's own MMIO record.

--
regards
Yang, Sheng
--

From: Avi Kivity
Date: Tuesday, November 23, 2010 - 12:54 am

Don't understand.  If we support reading/writing entire entries, that 

When an entry is masked, we need to set the pending bit for it 
somewhere.  I guess this is broken in the existing code (without your 

So in-kernel handling of mmio would be decided per entry?  I'm trying to 
simplify this, and simplest thing is - all or nothing.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Yang, Sheng
Date: Tuesday, November 23, 2010 - 1:30 am

Even with my patch, we didn't support the pending bit. It would always return 0 
now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING flag 
of irq_desc->status(if the entry is masked), and return the result to userspace.

That would involve some core change, like to export irq_to_desc(). I don't think 

So you would like to handle all MSI-X MMIO in kernel?

--
regards
Yang, Sheng
--

From: Avi Kivity
Date: Tuesday, November 23, 2010 - 5:47 am

The API needs to be compatible with the pending bit, even if we don't 

Yes.  Writes to address or data would be handled by:
- recording it into the shadow msix table
- notifying userspace that msix entry x changed
Reads would be handled in kernel from the shadow msix table.

So instead of

- guest reads/writes msix
- kvm filters mmio, implements some, passes others to userspace

we have

- guest reads/writes msix
- kvm implements all
- some writes generate an additional notification to userspace


-- 
error compiling committee.c: too many arguments to function

--

From: Michael S. Tsirkin
Date: Tuesday, November 23, 2010 - 5:56 am

One small proposal in addition: since all accesses are done from guest
anyway, the shadow table can/should be stored using userspace memory,
reducing the kernel memory overhead of the feature from up to 4K per
MSIX table to just 8 bytes.

--

From: Yang, Sheng
Date: Tuesday, November 23, 2010 - 6:57 am

I am not sure about it. But I suppose the structure should be the same? In fact 
it's pretty hard for me to image what's needed for virtio in the future, 
especially there is no such code now. I really prefer to deal with assigned device 
and virtio separately, which would make the work much easier. But seems you won't 

This can be implemented by this API, just adding a flag for it. And I would still 

I suppose we don't need to generate notification to userspace? Because every 
read/write is handled by kernel, and userspace just need interface to kernel to 
get/set the entry - and well, does userspace need to do it when kernel can handle 
all of them? Maybe not...

--
regards
Yang, Sheng
--

From: Avi Kivity
Date: Tuesday, November 23, 2010 - 7:06 am

First, I don't really see why the two cases are different (but I don't 
do a lot in this space).  Surely between you and Michael, you have all 
the information?

Second, my worry is a huge number of ABI variants that come from 
incrementally adding features.  I want to implement bigger chunks of 
functionality.  So I'd like to see all potential users addressed, at 


We could have the kernel handle addr/data writes by setting up an 
internal interrupt routing.  A disadvantage is that more work is needed 
if we emulator interrupt remapping in qemu.

-- 
error compiling committee.c: too many arguments to function

--

From: Michael S. Tsirkin
Date: Tuesday, November 23, 2010 - 8:11 am

As an alternative, interrupt remapping will need some API rework, right?
Existing APIs only pass address/data for msi.

-- 
MST
--

From: Gleb Natapov
Date: Tuesday, November 23, 2010 - 8:24 am

IIRC interrupt remapping works with address/data to. It just interpret
it differently from apic.

--
			Gleb.
--

From: Michael S. Tsirkin
Date: Tuesday, November 23, 2010 - 9:10 am

Yes. So since our APIs use address/data, this is an argument for doing
--

From: Yang, Sheng
Date: Tuesday, November 23, 2010 - 6:59 pm

Of course KVM should service reading from pending bitmask. For assigned device, 
it's kernel who would set the pending bit; but I am not sure for virtio. This 

In fact modifying irq routing in the kernel is also the thing I want to avoid.

So, the flow would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, by one specific exit reason
QEmu know it have to sync the MSI table, then reading the entries from kernel
QEmu found it's an write, so it need to reprogram irq routing table using the 
entries above
done

But wait, why should qemu read entries from kernel? By default exit we already 
have the information about what's the entry to modify and what to write, so we can 
use them directly. By this way, we also don't need an specific exit reason - just 
exit to qemu in normal way is fine.

Then it would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, indicate MMIO exit
QEmu found it's an write, it would update it's own MSI table(may need to query 
mask bit from kernel), and reprogram irq routing table using the entries above
done

Then why should kernel kept it's own MSI table? I think the only reason is we can 
speed up reading in that way - but the reading we want to speed up is mostly on 
enabled entry(the first entry), which is already in the IRQ routing table... 

And for enabled/disabled entry, you can see it like this: for the entries inside 
routing table, we think it's enabled; otherwise it's disabled. Then you don't need 
to bothered by pci_enable_msix().

So our strategy for reading accelerating can be:

If the entry contained in irq routing table, then use it; otherwise let qemu deal 
with it. Because it's the QEmu who owned irq routing table, the synchronization is 
guaranteed. We don't need the MSI table in the kernel then.

And for writing, we just want to cover all of mask bit, but none of others.

I think the concept here is more acceptable?

The issue here is MSI table and irq routing table ...
From: Yang, Sheng
Date: Thursday, November 25, 2010 - 7:35 pm

Avi?

And BTW, we can take routing table as a kind of *cache*, if the content is in the 
cache, then we can fetch it from the cache, otherwise we need to go back to fetch 
it from memory(userspace). 

--
regards
--

From: Avi Kivity
Date: Tuesday, November 30, 2010 - 7:15 am

The kernel should manage it in the same way.  Virtio raises irq (via 
KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.

Note we need to be able to read and write the pending bitmask for live 

Because we have an interface where you get an exit if (addr % 4) < 3 and 
don't get an exit if (addr % 4) == 3.  There is a gpa range which is 
partially maintained by the kernel and partially in userspace.  It's a 
confusing interface.  Things like 64-bit reads or writes need to be 
broken up and serviced in two different places.

We already need to support this (for unaligned writes which hit two 

The reason is to keep a sane interface.  Like we emulate instructions 
and msrs in the kernel and don't do half a job.  I don't think there's a 

I agree about letting qemu manage the irq routing table.  It changes 
very rarely.  I just prefer to let it know about the change via 


If it's guaranteed by the spec that addr/data pairs are always 
interpreted in the same way, sure.  But there no reason to do it, 
really, it isn't a fast path.

-- 
error compiling committee.c: too many arguments to function

--

From: Yang, Sheng
Date: Tuesday, November 30, 2010 - 7:36 pm

Then seems we still need to an writing interface for it. And I think we can work 

Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're 
illegal according to the PCI spec(otherwise the result is undefined according to 
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next 
version, and discard all illegal writing. And 64-bit accessing would be broken up 
in qemu as you said, as they do currently. 

In fact I think we can handle all data for 64-bit to qemu, because it should still 
sync the mask bit with kernel, which make the maskbit writing in userspace useless 

Here is the case we've observed with Xen. It can only be reproduced by large scale 
testing. When the interrupt intensity is very high, even new kernels would try to 
make it lower, then it would access mask bit very frequently. And in the kernel, 
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)               
{                                                                           
        struct msi_desc *desc = irq_data_get_msi(data);                     
                                                                            
        if (desc->msi_attrib.is_msix) {                                     
                msix_mask_irq(desc, flag);                                  
                readl(desc->mask_base);         /* Flush write to device */ 
        } else {                                                            
                unsigned offset = data->irq - desc->dev->irq;               
                msi_mask_irq(desc, 1 << offset, flag << offset);            
        }                                                                   
}                                                                           

That flush by readl() would complied with each MSI-X mask writing. That is the only 
place we want to accelerate, otherwise it would still fall back to qemu each time 


I am ...
From: Avi Kivity
Date: Thursday, December 2, 2010 - 6:09 am

I'm not sure.  Suppose a guest starts using pending bits.  Does it mean 
it will not work with kernel msix acceleration?

If that is the case, then we need pending bit support now.  I don't want 
to knowingly merge something that doesn't conform to the spec, forcing 


So it seems we do want to accelerate reads to the entire entry.  This 

Which case?  the readl() doesn't need access to the routing table, just 
the entry.

Oh, I think there is a terminology problem, I was talking about kvm's 
irq routing table, you were talking about the msix entries.

I think treating it as a cache causes more problems, since there are now 
two paths for reads (in cache and not in cache) and more things for 
writes to manage.

Here's my proposed API:

KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, 
pending_bitmap_base_gpa)

  - called when the guest enables msix

KVM_REMOVE_MSIX_TABLE(table_id)

   - called when the guest disables msix

KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)

   - called when the guest enables msix (to initialize it), or after 
live migration

KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)

   - called when the guest updates an msix table (triggered by 
KVM_EXIT_MSIX_ENTRY_UPDATED)

KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED

   - returned from KVM_RUN after the guest touches the first three words 
of an msix entry

KVM_SET_MSIX_ENTRY_IRQ(table_id, entry_id, msix_irq_desc)

msix_irq_desc could describe an assigned device irq, or gsi that is 
mapped to an msix interrupt in the kvm routing table.

Michael?  I think that should work for virtio and vfio assigned 
devices?  Not sure about pending bits.

-- 
error compiling committee.c: too many arguments to function

--

From: Michael S. Tsirkin
Date: Thursday, December 2, 2010 - 6:47 am

One thing that read should do is flush in the outstanding

I would add virtual addresses so that we can use swappable memory to
store the state.
If we do, maybe we can just keep the table there and then


Pending bits must be tracked in kernel, but I don't see
how we can support polling mode if we don't exit to userspace
on pending bit reads.

This does mean that some reads will be fast and some will be
slow, and it's a bit sad that we seem to be optimizing
for specific guests, but I just can't come up with
anything better.

So maybe just add an ioctl to get and to clear pending bits.
--

From: Avi Kivity
Date: Thursday, December 2, 2010 - 6:56 am

The mask bit writes are synchronous.

wrt interrupts, we can deal with assigned devices, and can poll irqfds.  
But we can't force vhost-net to issue an interrupt (and I don't think 


Still need to to let userspace know it needs to reprogram the irqfd or 


If the pending bits live in userspace memory, the device model can 

For live migration too.  But if they live in memory, no need for 
get/set, just specify the address.

-- 
error compiling committee.c: too many arguments to function

--

From: Michael S. Tsirkin
Date: Thursday, December 2, 2010 - 7:26 am

To clarify:

	mask write
	read

it is safe for guest to assume no more interrupts

where as with a simple
	mask write


Why do we need to reprogram irqfd?  I thought irqfd would map to an
entry within the table instead of address/data as now.

So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
enabled (note: it can not be called at boot anyway since pa

Note that these are updated on an interrupt, so updating them
in userspace would need get_user_page etc trickery,
and add the overhead of atomics.

Further I think it's important to avoid the overhead of updating them
all the time, and only do this when an interrupt is
masked or on pending bits read. Since userspace does not know
--

From: Sheng Yang
Date: Thursday, December 2, 2010 - 7:54 am

Don't agree. MMIO can be write regardless of if MSIX is enabled. If
you handle MMIO to kernel, them handle them all. I suppose qemu still

In fact qemu's accessing to MMIO should be quite rare after moving all
the things to the kernel. Using IOCTL is also fine with me.

And how to "do update on each read"?

-- 
regards,
--

From: Michael S. Tsirkin
Date: Thursday, December 2, 2010 - 9:55 am

Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
think passing control from userspace to kernel when MSIX is enabled is

Yes. So e.g. if memory is disabled in the device then we must
disable this as well, and not claim these transactions

current place?

At the moment qemu does not notify devices when
memory is disabled, only when it is enabled.



When read of pending bits is detected, we could forward it up to qemu.
Qemu could check internal device state and clear bits that are no longer
--

From: Yang, Sheng
Date: Thursday, December 2, 2010 - 8:03 pm

Anyway we need ioctls to do so because kernel knows nothing about PCI configuration 


You mean bit 2 of Command register? I think suppose device would only disable 
memory when shut down should be acceptable. Also we can hook at disable point as 

Don't agree. Then you got duplicate between kernel and userspace. Also the 

Don't understand why we need turn to qemu when everything is ready in kernel. And 
pending bit is still in the scope of PCI spec. Also it's can only be written by 
the kernel who emulate the table. Of course, device model can read from it.

--
regards
--

From: Michael S. Tsirkin
Date: Tuesday, November 23, 2010 - 5:04 am

Unfortunately, it can't I think, unless all your guests are linux.
"enabled entries" is a linux kernel concept.
The MSIX spec only tells you which entries are masked and which are unmasked.

-- 
MST
--

From: Yang, Sheng
Date: Tuesday, November 23, 2010 - 7:02 am

Can't understand what you are talking about, and how it related to the guest OS. I 
was talking about pci_enable_msix() in the host Linux.

--
regards
Yang, Sheng
--

Previous thread: Performance test result between virtio_pci MSI-X disable and enable by lidong chen on Monday, November 22, 2010 - 7:53 pm. (21 messages)

Next thread: [PATCH trace-cmd 0/3] kvm plugin updates by Avi Kivity on Tuesday, November 23, 2010 - 3:58 am. (5 messages)