Re: [RFC] Unify KVM kernel-space and user-space code into a single project

Previous thread: linux-next: Tree for March 16 by Stephen Rothwell on Monday, March 15, 2010 - 9:28 pm. (1 message)

Next thread: Re: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree by KOSAKI Motohiro on Monday, March 15, 2010 - 10:47 pm. (15 messages)
From: Zhang, Yanmin
Date: Monday, March 15, 2010 - 10:27 pm

From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

--------------------------------------------------------------------------------------------------------------------------
   PerfTop:   16010 irqs/sec  kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ _______________________

            38770.00 20.4% __ticket_spin_lock        [guest.kernel.kallsyms]
            22560.00 11.9% ftrace_likely_update      [kernel.kallsyms]
             9208.00  4.8% __lock_acquire            [kernel.kallsyms]
             5473.00  2.9% trace_hardirqs_off_caller [kernel.kallsyms]
             5222.00  2.7% copy_user_generic_string  [guest.kernel.kallsyms]
             4450.00  2.3% validate_chain            [kernel.kallsyms]
             4262.00  2.2% trace_hardirqs_on_caller  [kernel.kallsyms]
             4239.00  2.2% do_raw_spin_lock          [kernel.kallsyms]
             3548.00  1.9% do_raw_spin_unlock        [kernel.kallsyms]
             2487.00  1.3% ...
From: Avi Kivity
Date: Monday, March 15, 2010 - 10:41 pm

Excellent, support for guest kernel != host kernel is critical (I can't 
remember the last time I ran same kernels).

How would we support multiple guests with different kernels?  Perhaps a 
symbol server that perf can connect to (and that would connect to guests 


Should be in common code, not vmx specific.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 12:24 am

The highest quality solution would be if KVM offered a 'guest extension' to 
the guest kernel's /proc/kallsyms that made it easy for user-space to get this 
information from an authorative source.

That's the main reason why the host side /proc/kallsyms is so popular and so 
useful: while in theory it's mostly redundant information which can be gleaned 
from the System.map and other sources of symbol information, it's easily 
available and is _always_ trustable to come from the host kernel.

Separate System.map's have a tendency to go out of sync (or go missing when a 
devel kernel gets rebuilt, or if a devel package is not installed), and server 
ports (be that a TCP port space server or an UDP port space mount-point) are 
both a configuration hassle and are not guest-transparent.

So for instrumentation infrastructure (such as perf) we have a large and well 
founded preference for intrinsic, built-in, kernel-provided information: i.e. 
a largely 'built-in' and transparent mechanism to get to guest symbols.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 2:20 am

The symbol server's client can certainly access the bits through vmchannel.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 2:53 am

Ok, that would work i suspect.

Would be nice to have the symbol server in tools/perf/ and also make it easy 
to add it to the initrd via a .config switch or so.

That would have basically all of the advantages of being built into the kernel 
(availability, configurability, transparency, hackability), while having all 
the advantages of a user-space approach as well (flexibility, extensibility, 
robustness, ease of maintenance, etc.).

If only we had tools/xorg/ integrated via the initrd that way ;-)

Thanks,

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 3:13 am

Note, I am not advocating building the vmchannel client into the host 
kernel.  While that makes everything simpler for the user, it increases 
the kernel footprint with all the disadvantages that come with that (any 
bug is converted into a host DoS or worse).

So, perf would connect to qemu via (say) a well-known unix domain 
socket, which would then talk to the guest kernel.

I know you won't like it, we'll continue to disagree on this unfortunately.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 3:20 am

Neither am i. What i suggested was a user-space binary/executable built in 
tools/perf and put into the initrd.

That approach has the advantages i listed above, without having the 
disadvantages of in-kernel code you listed.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 3:40 am

I'm confused - initrd seems to be guest-side.  I was talking about the 
host side.

For the guest, placing the symbol server in tools/ is reasonable.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 3:50 am

host side doesnt need much support - just some client capability in perf 
itself. I suspect vmchannels are sufficiently flexible and configuration-free 
for such purposes? (i.e. like a filesystem in essence)

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 4:10 am

I haven't followed vmchannel closely, but I think it is.  vmchannel is 
terminated in qemu on the host side, not in the host kernel.  So perf 
would need to connect to qemu.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 4:25 am

Hm, that sounds rather messy if we want to use it to basically expose kernel 
functionality in a guest/host unified way. Is the qemu process discoverable in 
some secure way? Can we trust it? Is there some proper tooling available to do 
it, or do we have to push it through 2-3 packages to get such a useful feature 
done?

( That is the general thought process how many cross-discipline useful
  desktop/server features hit the bit bucket before having had any chance of
  being vetted by users, and why Linux sucks so much when it comes to feature
  integration and application usability. )

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 5:21 am

libvirt manages qemu processes, but I don't think this should go through 
libvirt.  qemu can do this directly by opening a unix domain socket in a 

You can't solve everything in the kernel, even with a well populated tools/.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 5:29 am

How do i get a list of all 'guest instance PIDs', and what is the way to talk 

I mean, i can trust a kernel service and i can trust /proc/kallsyms.

Can perf trust a random process claiming to be Qemu? What's the trust 

So Qemu has never run into such problems before?

( Sounds weird - i think Qemu configuration itself should be done via a 

Certainly not, but this is a technical problem in the kernel's domain, so it's 
a fair (and natural) expectation to be able to solve this within the kernel 
project.

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 5:41 am

Libvirt manages all qemus, but this should be implemented independently 

In general qemu exposes communication channels (such as the monitor) as 

Obviously you can't trust anything you get from a guest, no matter how 
you get it.

How do you trust a userspace program's symbols?  you don't.  How do you 

That's exactly what happens.  You invoke qemu with -monitor 
unix:blah,server (or -qmp for a machine-readable format) and have your 
management application connect to that.  You can redirect guest serial 
ports, console, parallel port, etc. to unix-domain or tcp sockets.  

Someone writing perf-gui outside the kernel would have the same 
problems, no?

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 6:08 am

I'm not talking about the symbol strings and addresses, and the object 
contents for allocation (or debuginfo). I'm talking about the basic protocol 
of establishing which guest is which.

I.e. we really want to be able users to:

 1) have it all working with a single guest, without having to specify 'which' 
    guest (qemu PID) to work with. That is the dominant usecase both for 
    developers and for a fair portion of testers.

 2) Have some reasonable symbolic identification for guests. For example a 
    usable approach would be to have 'perf kvm list', which would list all 
    currently active guests:

     $ perf kvm list
       [1] Fedora
       [2] OpenSuse
       [3] Windows-XP
       [4] Windows-7

    And from that point on 'perf kvm -g OpenSuse record' would do the obvious 
    thing. Users will be able to just use the 'OpenSuse' symbolic name for 
    that guest, even if the guest got restarted and switched its main PID.

Any such facility needs trusted enumeration and a protocol where i can trust 
that the information i got is authorative. (I.e. 'OpenSuse' truly matches to 
the OpenSuse session - not to some local user starting up a Qemu instance that 
claims to be 'OpenSuse'.)

Is such a scheme possible/available? I suspect all the KVM configuration tools 
(i havent used them in some time - gui and command-line tools alike) use 
similar methods to ease guest management?

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 6:16 am

There is none.  So far, qemu only dealt with managing just its own 
guest, and left all multiple guest management to higher levels up the 


You can do that through libvirt, but that only works for guests started 
through libvirt.  libvirt provides command-line tools to list and manage 
guests (for example autostarting them on startup), and tools built on 
top of libvirt can manage guests graphically.

Looks like we have a layer inversion here.  Maybe we need a plugin 
system - libvirt drops a .so into perf that teaches it how to list 
guests and get their symbols.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 6:31 am

IMO such ease of use is reasonable and required, full stop.

If it cannot be gotten simply then that's a bug: either in the code, or in the 
design, or in the development process that led to the design. Bugs need 

Is libvirt used to start up all KVM guests? If not, if it's only used on some 
distros while other distros have other solutions then there's apparently no 
good way to get to such information, and the kernel bits of KVM do not provide 
it.

To the user (and to me) this looks like a KVM bug / missing feature. (and the 
user doesnt care where the blame is) If that is true then apparently the 
current KVM design has no technically actionable solution for certain 
categories of features!

	Ingo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 6:37 am

Developers tend to start qemu from the command line, but the majority of 
users and all distros I know of use libvirt.  Some users cobble up their 

A plugin system allows anyone who is interested to provide the 
information; they just need to write a plugin for their management tool.

Since we can't prevent people from writing management tools, I don't see 
what else we can do.

-- 
error compiling committee.c: too many arguments to function

--

From: Frank Ch. Eigler
Date: Tuesday, March 16, 2010 - 8:06 am

Perhaps the fact that kvm happens to deal with an interesting
application area (virtualization) is misleading here.  As far as the
host kernel or other host userspace is concerned, qemu is just some
random unprivileged userspace program (with some *optional* /dev/kvm
services that might happen to require temporary root).

As such, perf trying to instrument qemu is no different than perf
trying to instrument any other userspace widget.  Therefore, expecting
'trusted enumeration' of instances is just as sensible as using
'trusted ps' and 'trusted /var/run/FOO.pid files'.


- FChE
--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 8:52 am

You are quite mistaken: KVM isnt really a 'random unprivileged application' in 
this context, it is clearly an extension of system/kernel services.

( Which can be seen from the simple fact that what started the discussion was 
  'how do we get /proc/kallsyms from the guest'. I.e. an extension of the 
  existing host-space /proc/kallsyms was desired. )

In that sense the most natural 'extension' would be the solution i mentioned a 
week or two ago: to have a (read only) mount of all guest filesystems, plus a 
channel for profiling/tracing data. That would make symbol parsing easier and 
it's what extends the existing 'host space' abstraction in the most natural 
way.

( It doesnt even have to be done via the kernel - Qemu could implement that
  via FUSE for example. )

As a second best option a 'symbol server' might be used too.

Thanks,

	Ingo
--

From: Frank Ch. Eigler
Date: Tuesday, March 16, 2010 - 9:08 am

Hi -


I don't know what "extension of system/kernel services" means in this
context, beyond something running on the system/kernel, like every
other process.  To clarify, to what extent do you consider your
classification similarly clear for a host is running

* multiple kvm instances run as unprivileged users
* non-kvm OS simulators such as vmware or xen or gdb

(Sorry, that smacks of circular reasoning.)

It may be a charming convenience function for perf users to give them
shortcuts for certain favoured configurations (kvm running freshest
linux), but that says more about perf than kvm.


- FChE
--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 9:35 am

It means something like my example of 'extended to guest space' 

To me it sounds like an example supporting my point. /proc/kallsyms is a 
service by the kernel, and 'perf kvm' desires this to be extended to guest 
space as well.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Tuesday, March 16, 2010 - 10:34 am

Random tools (like perf) should not be able to do what you describe.  
It's a security nightmare.

If it's desirable to have /proc/kallsyms available, we can expose an 
interface in QEMU to provide that.  That can then be plumbed through 
libvirt and QMP.

Then a management tool can use libvirt or QMP to obtain that information 

No way.  The guest has sensitive data and exposing it widely on the host 
is a bad thing to do.  It's a bad interface.  We can expose specific 
information about guests but only through our existing channels which 
are validated through a security infrastructure.

Ultimately, your goal is to keep perf a simple tool with little 
dependencies.  But practically speaking, if you want to add features to 
it, it's going to have to interact with other subsystems in the 
appropriate way.  That means, it's going to need to interact with 
libvirt or QMP.

If you want all applications to expose their data via synthetic file 
systems, then there's always plan9 :-)

Regards,

Anthony Liguori
--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 10:52 am

A security nightmare exactly how? Mind to go into details as i dont understand 

Firstly, you are putting words into my mouth, as i said nothing about 
'exposing it widely'. I suggest exposing it under the privileges of whoever 
has access to the guest image.

Secondly, regarding confidentiality, and this is guest security 101: whoever 
can access the image on the host _already_ has access to all the guest data!

A Linux image can generally be loopback mounted straight away:

  losetup -o 32256 /dev/loop0 ./guest-image.img
  mount -o ro /dev/loop0 /mnt-guest

(Or, if you are an unprivileged user who cannot mount, it can be read via ext2 
tools.)

There's nothing the guest can do about that. The host is in total control of 
guest image data for heaven's sake!

All i'm suggesting is to make what is already possible more convenient.

	Ingo
--

From: Anthony Liguori
Date: Tuesday, March 16, 2010 - 11:06 am

Assume you're using SELinux to implement mandatory access control.  How 
do you label this file system?

Generally speaking, we don't know the difference between /proc/kallsyms 
vs. /dev/mem if we do generic passthrough.  While it might be safe to 
have a relaxed label of kallsyms (since it's read only), it's clearly 
not safe to do that for /dev/mem, /etc/shadow, or any file containing 
sensitive information.

Rather, we ought to expose a higher level interface that we have more 
confidence in with respect to understanding the ramifications of 

That doesn't work as nicely with SELinux.

It's completely reasonable to have a user that can interact in a read 
only mode with a VM via libvirt but cannot read the guest's disk images 

It's not that simple in a MAC environment.

Regards,


--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 11:28 am

What's your _point_? Please outline a threat model, a vector of attack, 

Exactly, we want something that has a flexible namespace and works well with 
Linux tools in general. Preferably that namespace should be human readable, 
and it should be hierarchic, and it should have a well-known permission model.


If a user cannot read the image file then the user has no access to its 
contents via other namespaces either. That is, of course, a basic security 
aspect.

( That is perfectly true with a non-SELinux Unix permission model as well, and

Erm. Please explain to me, what exactly is 'not that simple' in a MAC 
environment?

Also, i'd like to note that the 'restrictive SELinux setups' usecases are 
pretty secondary.

To demonstrate that, i'd like every KVM developer on this list who reads this 
mail and who has their home development system where they produce their 
patches set up in a restrictive MAC environment, in that you cannot even read 
the images you are using, to chime in with a "I'm doing that" reply.

If there's just a _single_ KVM developer amongst dozens and dozens of 
developers on this list who develops in an environment like that i'd be 
surprised. That result should pretty much tell you where the weight of 
instrumentation focus should lie - and it isnt on restrictive MAC environments 
...

	Ingo
--

From: Anthony Liguori
Date: Tuesday, March 16, 2010 - 4:04 pm

You suggested "to have a (read only) mount of all guest filesystems".

As I described earlier, not all of the information within the guest 
filesystem has the same level of sensitivity.  If you exposed a generic 
interface like this, it makes it very difficult to delegate privileges.

Delegating privileges is important because from in a higher security 
environment, you may want to prevent a management tool from accessing 
the VM's disk directly, but still allow it to do basic operations (in 

If you want to use a synthetic filesystem as the management interface 
for qemu, that's one thing.  But you suggested exposing the guest 

I don't think that's reasonable at all.  The guest may encrypt it's disk 

My home system doesn't run SELinux but I work daily with systems that 
are using SELinux.

I want to be able to run tools like perf on these systems because 
ultimately, I need to debug these systems on a daily basis.

But that's missing the point.  We want to have an interface that works 
for both cases so that we're not maintaining two separate interfaces.

We've rat holed a bit though.  You want:

1) to run perf kvm list and be able to enumerate KVM guests

2) for this to Just Work with qemu guests launched from the command line

You could achieve (1) by tying perf to libvirt but that won't work for 
(2).  There are a few practical problems with (2).

qemu does not require the user to associate any uniquely identifying 
information with a VM.  We've also optimized the command line use case 
so that if all you want to do is run a disk image, you just execute 
"qemu foo.img".  To satisfy your use case, we would either have to force 
a use to always specify unique information, which would be less 
convenient for our users or we would have to let the name be an optional 
parameter.

As it turns out, we already support "qemu -name Fedora foo.img".  What 
we don't do today, but I've been suggesting we should, is automatically 
create a QMP management socket in a well ...
From: Frank Ch. Eigler
Date: Tuesday, March 16, 2010 - 5:41 pm

Hi -


To what extent could this be solved with less crossing of
isolation/abstraction layers, if the perfctr facilities were properly
virtualized?  That way guests could run perf goo internally.
Optionally virt tools on the host side could aggregate data from
cooperating self-monitoring guests.

- FChE
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 8:54 pm

That's the more interesting (by far) usage model.  In general guest 
owners don't have access to the host, and host owners can't (and 
shouldn't) change guests.

Monitoring guests from the host is useful for kvm developers, but less 
so for users.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 1:16 am

Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' 
will work if a proper paravirt channel is opened to the host)

I think you might have misunderstood the purpose and role of the 'perf kvm' 
patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM 
code, not guest kernel users.

	Ingo
--

From: Avi Kivity
Date: Wednesday, March 17, 2010 - 1:20 am

Of course I understood it.  My point was that 'perf kvm' serves a tiny 
minority of users.  That doesn't mean it isn't useful, just that it 
doesn't satisfy all needs by itself.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 1:59 am

I hope you wont be disappointed to learn that 100% of Linux, all 13+ million 
lines of it, was and is being developed by a tiny, tiny, tiny minority of 

Of course - and it doesnt bring world peace either. One step at a time.

Thanks,

	Ingo
--

From: Huang, Zhiteng
Date: Wednesday, March 17, 2010 - 10:27 pm

Hi Avi, Ingo,

I've been following through this long thread since the very first email.  

I'm a performance engineer whose job is to tune workloads run on top of KVM (and Xen previously).  As a performance engineer, I desperately want to have a tool that can monitor the host and guests at same time.  Think about >100 guests mixed with Linux/Windows running together on single system, being able to know what's happening is critical to do performance analysis.   Actually I am the person asked Yanmin to add feature for CPU utilization break down (into host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of running guests.   I hasn't made this patch work on my system yet but I _do_ think this patch is a very good start.  

And finally, monitoring guests from host is useful for users too (administrator and performance guy like me).   I really appreciate you guys' work and would love to provide feedback from my point of view if needed.


Regards,

HUANG, Zhiteng

Intel SSG/SSD/SPA/PRC Scalability Lab


-----Original Message-----
From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Avi Kivity
Sent: Wednesday, March 17, 2010 11:55 AM
To: Frank Ch. Eigler
Cc: Anthony Liguori; Ingo Molnar; Zhang, Yanmin; Peter Zijlstra; Sheng Yang; linux-kernel@vger.kernel.org; kvm@vger.kernel.org; Marcelo Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; ziteng.huang@intel.com
Subject: Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side


That's the more interesting (by far) usage model.  In general guest 
owners don't have access to the host, and host owners can't (and 
shouldn't) change guests.

Monitoring guests from the host is useful for kvm developers, but less 
so for users.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 1:14 am

Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone 
else: it offers the /dev/perfctr special-purpose device that allows raw, 
unabstracted, low-level access to the PMU.

I suspect the one you wanted to mention here is called 'perf' or 'perf 
events'. (and used to be called 'performance counters' or 'perfcounters' until 
it got renamed about a year ago)

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 1:53 am

What did you think, that it would be world-readable? Why would we do such a 
stupid thing? Any mounted content should at minimum match whatever policy 
covers the image file. The mounting of contents is not a privilege escallation 
and it is already possible today - just not integrated properly and not 
practical. (and apparently not implemented for all the wrong 'security' 

_In_ the guest you can of course run it just fine. (once paravirt bits are in 
place)

That has no connection to 'perf kvm' though, which this patch submission is 
about ...

If you want unified profiling of both host and guest then you need access to 
both the guest and the host. This is what the 'perf kvm' patch is about. 
Please read the patch, i think you might be misunderstanding what it does ...

Regarding encrypted contents - that's really a distraction but the host has 
absolute, 100% control over the guest and there's nothing the guest can do 
about that - unless you are thinking about the sub-sub-case of Orwellian 
DRM-locked-down systems - in which case there's nothing for the host to mount 
and the guest can reject any requests for information on itself and impose 
additional policy that way. So it's a security non-issue.

Note that DRM is pretty much the worst place to look at when it comes to 
usability: DRM lock-down is the anti-thesis of usability. Do you really want 
KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a 
developer cannot mount his own disk image? Pretty please, help Linux instead, 
where development is driven by usability and accessibility ...

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Tuesday, March 16, 2010 - 10:06 am

You're making too many assumptions.

There is no list of guests anymore than there is a list of web browsers.

You can have a multi-tenant scenario where you have distinct groups of 

Does "perf kvm list" always run as root?  What if two unprivileged users 
both have a VM named "Fedora"?

If we look at the use-case, it's going to be something like, a user is 
creating virtual machines and wants to get performance information about 
them.

Having to run a separate tool like perf is not going to be what they 
would expect they had to do.  Instead, they would either use their 
existing GUI tool (like virt-manager) or they would use their management 
interface (either QMP or libvirt).

The complexity of interaction is due to the fact that perf shouldn't be 
a stand alone tool.  It should be a library or something with a 
programmatic interface that another tool can make use of.

Regards,


--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 10:39 am

"multi-tenant" and groups is not a valid excuse at all for giving crappy 
technology in the simplest case: when there's a single VM. Yes, eventually it 
can be supported and any sane scheme will naturally support it too, but it's 
by no means what we care about primarily when it comes to these tools.

I thought everyone learned the lesson behind SystemTap's failure (and to a 
certain degree this was behind Oprofile's failure as well): when it comes to 
tooling/instrumentation we dont want to concentrate on the fancy complex 
setups and abstract requirements drawn up by CIOs, as development isnt being 
done there. Concentrate on our developers today, and provide no-compromises 
usability to those who contribute stuff.

If we dont help make the simplest (and most common) use-case convenient then 

Again, the single-VM case is the most important case, by far. If you have 
multiple VMs running and want to develop the kernel on multiple VMs (sounds 
rather messy if you think it through ...), what would happen is similar to 
what happens when we have two probes for example:

 # perf probe schedule
 Added new event:
   probe:schedule                           (on schedule+0)

 You can now use it on all perf tools, such as:

 	perf record -e probe:schedule -a sleep 1

 # perf probe -f schedule   
 Added new event:
   probe:schedule_1                         (on schedule+0)

 You can now use it on all perf tools, such as:

 	perf record -e probe:schedule_1 -a sleep 1

 # perf probe -f schedule
 Added new event:
   probe:schedule_2                         (on schedule+0)

 You can now use it on all perf tools, such as:

 	perf record -e probe:schedule_2 -a sleep 1

Something similar could be used for KVM/Qemu: whichever got created first is 

But ... a GUI interface/integration is of course possible too, and it's being 
worked on.

perf is mainly a kernel developer tool, and kernel developers generally dont 
use GUIs to do their stuff: which is the (sole) reason why ...
From: Anthony Liguori
Date: Tuesday, March 16, 2010 - 4:07 pm

It's about who owns the user interface.

If qemu owns the user interface, than we can satisfy this in a very 
simple way by adding a perf monitor command.  If we have to support 
third party tools, then it significantly complicates things.

Regards,


--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 1:10 am

Of course illogical modularization complicates things 'significantly'.

I wish both you and Avi looked back 3-4 years and realized what made KVM so 
successful back then and why the hearts and minds of virtualization developers 
were captured by KVM almost overnight.

KVM's main strength back then was that it was a surprisingly functional piece 
of code offered by a 10 KLOC patch - right on the very latest upstream kernel. 
Code was shared with upstream, there was version parity, and it all was in the 
same single repo which was (and is) a pleasure to develop on.

Unlike Xen, which was a 200+ KLOC patch on top of a forked 10 MLOC kernel a 
few upstream versions back. Xen had constant version friction due to that fork 
and due to that forced/false separation/modularization: Xen _itself_ was a 
fork of Linux to begin with. (for exampe Xen still had my copyrights last i 
checked, which it got from old Linux code i worked on)

That forced separation and version friction in Xen was a development and 
productization nightmare, and developing on KVM was a truly refreshing 
experience. (I'll go out on a limb to declare that you wont find a _single_ 
developer on this list who will tells us otherwise.)

Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the 
worst-quality remaining aspects of KVM are precisely in areas that you 
mention: 'if we have to support third party tools, then it significantly 
complicates things'. You kept Qemu as an external 'third party' entity to KVM, 
and KVM is clearly hurting from that - just see the recent KVM usability 
thread for examples about suckage.

So a similar 'complication' is the crux of the matter behind KVM quality 
problems: you've not followed through with the original KVM vision and you 
have not applied that concept to Qemu!

And please realize that the user does not care that KVM's kernel bits are top 
notch, if the rest of the package has sucky aspects: it's always the weakest 
link of the chain that ...
From: Avi Kivity
Date: Thursday, March 18, 2010 - 1:20 am

Any qemu usability problems are because developers (or their employers) 
are not interested in fixing them, not because of the repository 
location.  Most kvm developer interest is in server-side deployment 
(even for desktop guests), so there is limited effort in implementing a 

I'll ignore the repository location which should be immaterial to a 
serious developer and concentrate on the 'clean and minimal' aspect, 
since it has some merit.  Qemu development does have a tension between 
the needs of kvm and tcg.  For kvm we need fine-grained threading to 
improve performance and tons of RAS work.  For tcg these are mostly 
meaningless, and the tcg code has sufficient inertia to reduce the rate 
at which we can develop.

Nevertheless, the majority of developers feel that we'll lose more by a 

The majority of patches to qemu don't require changes to kvm, and vice 
versa.  The interface between qemu and kvm is fairly narrow, and most of 
the changes are related to save/restore and guest debugging, hardly 

When a feature is developed that requires both kernel and qemu changes, 
the same developer makes the changes in both projects.  Having them in 

Let's make a list of projects who don't need to be in the kernel 
repository, it will probably be shorted.

Seriously, libvirt is a cross-platform cross-hypervisor library, it has 

In fact I try hard not to rely too much on that.  While both kvm guest 
and host code are in the same repo, there is an ABI barrier between them 
because we need to support any guest version on any host version.  When 
designing, writing, or reading guest or host code that interacts across 
that barrier we need to keep forward and backward compatibility in 
mind.  It's very different from normal kernel APIs that we can adapt 

I really don't understand why you believe that.  You seem to want a 
virtualbox-style GUI, and lkml is probably the last place in the world 
to develop something like that.  The developers here are mostly 
uninterested in ...
From: Ingo Molnar
Date: Thursday, March 18, 2010 - 1:56 am

If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or 
series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and 
tools/perf/.

Numerous times did we have patches to kernel/perf_event.c that fixed some 
detail, also accompanied by a tools/perf/ patch fixing another detail. Having 
a single 'culture of contribution' is a powerful way to develop.

It turns out kernel developers can be pretty good user-space developers as 
well and user-space developers can be pretty good kernel developers as well. 
Some like to do both - as long as it's all within a single project.

The moment any change (be it as trivial as fixing a GUI detail or as complex 
as a new feature) involves two or more packages, development speed slows down 
to a crawl - while the complexity of the change might be very low!

Also, there's the harmful process that people start categorizing themselves 
into 'I am a kernel developer' and 'I am a user space programmer' stereotypes, 

The same has been said of oprofile as well: 'it somewhat sucks because we are 
too server centric', 'nobody is interested in good usability and oprofile is 
fine for the enterprises'. Ironically, the same has been said of Xen usability 
as well, up to the point KVM came around.

What was the core of the problem was a bad design and a split kernel-side 
user-side tool landscape.

In fact i think saying that 'our developers only care about the server' is 
borderline dishonest, when at the same time you are making it doubly sure (by 
inaction) that it stays so: by leaving an artificial package wall between 
kernel-side KVM and user-side KVM and not integrating the two technologies.

You'll never know what heights you could achieve if you leave that wall there 
...

Furthermore, what should be realized is that bad usability hurts "server 
features" just as much. Most of the day-to-day testing is done on the desktop 
by desktop oriented testers/developers. _Not_ by enterprise shops - they tend 
to see ...
From: Alexander Graf
Date: Thursday, March 18, 2010 - 2:24 am

It's not a 1:1 connection. There are more users of the KVM interface. To name a few I'm aware of:

- Mac-on-Linux (PPC)
- Dolphin (PPC)
- Xenner (x86)
- Kuli (s390)

Having a clear userspace interface is the only viable solution there. And if you're interested, look at my MOL enabling patch. It's less than 500 lines of code.

The kernel/userspace interface really isn't the difficult part. Getting device emulation working properly, easily and fast is.


Alex--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 3:10 am

There must be a misunderstanding here: tools/perf/ still has a clear userspace 
interface and ABI. There's external projects making use of it: sysprof and 
libpfm (and probably more i dont know about it). Those projects are also 
contributing back.

Still it's _very_ useful to have a single reference implementation under 
tools/perf/ where we concentrate the best of the code. That is where we make 
sure that each new kernel feature is appropriately implemented in user-space 
as well, that the combination works well together and is releasable to users. 
That is what keeps us all honest: the latency of features is much lower, and 
there's no ping-pong of blame going on between the two components in case of 
bugs or in case of misfeatures.

Same goes for KVM+Qemu: it would be so much nicer to have a single, 
well-focused reference implementation under tools/kvm/ and have improvements 
flow into that code base.

That way KVM developers cannot just shrug "well, GUI suckage is a user-space 
problem" - like the answers i got in the KVM usability thread ...

The buck will stop here.

And if someone thinks he can do better an external project can be started 

Why do you suppose that what i propose is an "either or" scenario?

It isnt. I just suggested that instead of letting core KVM fragment its limbs 
into an external entity, put your name behind one good all-around solution and 
focus the development model into a single project.

I.e. do what KVM has done originally in the kernel space to begin with - and 
where it was so much better than Xen: single focus.

Learn from what KVM has done so well in the initial years and use the concept 
on the user-space components as well. The very same arguments that caused KVM 
to integrate into the upstream kernel (instead of being a separate project) 
are a valid basis to integrate the user-space components into tools/kvm/. Dont 

The kernel/userspace ABI is not difficult at all. Getting device emulation 
working properly, easily and ...
From: Avi Kivity
Date: Thursday, March 18, 2010 - 3:21 am

That would make sense for a truly minimal userspace for kvm: we once had 
a tool called kvmctl which was used to run tests (since folded into 
qemu).  It didn't contain a GUI and was unable to run a general purpose 
guest.  It was a few hundred lines of code, and indeed patches to kvmctl 
had a much closer correspondence to patches with kvm (though still low, 

Suppose we copy qemu tomorrow into tools/.  All the problems will be 
copied with it.  Someone still has to write patches to fix them.  Who 

Moving emulation into the kernel is indeed a problem.  Not because it's 
difficult, but because it indicates that the interfaces exposed to 
userspace are insufficient to obtain good performance.  We had that with 

That's reasonable in the first iterations of a project.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 4:35 am

If it's functional to the extent of at least allowing say a serial console via 
the console (like the UML binary allows) i'd expect the minimal user-space to 
quickly grow out of this minimal state. The rest will be history.

Maybe this is a better, simpler (and much cleaner and less controversial) 
approach than moving a 'full' copy of qemu there.

There's certainly no risk: if qemu stays dominant then nothing is lost 
[tools/kvm/ can be removed after some time], and if this clean base works out 
fine then the useful qemu technologies will move over to it gradually and 
without much fuss, and the developers will move with it as well.

If it's just a token effort with near zero utility to begin with it certainly 
wont take off.

Once it's there in tools/kvm/ and bootable i'd certainly hack up some quick 
xlib based VGA output capability myself - it's not that hard ;-) It would also 
allow me to test whether latest-KVM still boots fine in a much simpler way. 
(most of my testboxes dont have qemu installed)


What we saw with tools/perf/ was that pure proximity to actual kernel testers 
and kernel developers produces a steady influx of new developers. It didnt 
happen overnight, but it happened. A simple:

  cd tools/perf/
  make -j install

Gets them something to play with. That kind of proximity is very powerful.

The other benefit was that distros can package perf with the kernel package, 
so it's updated together with the kernel. This means a very efficient 
distribution of new technologies, together with new kernel releases.

Distributions are very eager to update kernels even in stable periods of the 
distro lifetime - they are much less willing to update user-space packages.

You can literally get full KVM+userspace features done _and deployed to users_ 
within the 3 months development cycle of upstream KVM.

All these create synergies that are very clear once you see the process in 
motion. It's a powerful positive feedback loop. Give it some thought ...
From: Alexander Graf
Date: Thursday, March 18, 2010 - 5:00 am

Alright, you just volunteered. Just give it a go and try to implement
the "oh so simple" KVM frontend while maintaining compatibility with at
least a few older Linux guests. My guess is that you'll realize it's a
dead end before committing anything to the kernel source tree. But
really, just try it out.


Good Luck

Alex
--

From: Frank Ch. Eigler
Date: Thursday, March 18, 2010 - 5:33 am

Sorry, er, what?  What distributions eagerly upgrade kernels in stable
periods, were it not primarily motivated by security fixes?  What users
eagerly replace their kernels?

- FChE
--

From: John Kacur
Date: Thursday, March 18, 2010 - 6:01 am

Us guys reading and participating on the list. ;)
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 7:25 am

I'd like to second that - i'm actually quite happy to update the distro 
kernel. Also, i have rarely any problems even with bleeding edge kernels in 
rawhide - they are working pretty smoothly.

A large xorg update showing up in yum update gives me the cringe though ;-)

	Ingo
--

From: Frank Ch. Eigler
Date: Thursday, March 18, 2010 - 7:39 am

Hi -


From a parochial point of view, that makes perfect sense: someone
else's large software changes are a source of concern.  The same thing
applies to non-LKML people -- ordinary users -- when *your* large
software changes are proposed.

Perhaps this change in perspective would help you see the absurdity of
proposing kernel-2.6.git as a hosting repository for all kinds of
stuff, on the theory that kernel updates get pushed to "eager" users
more frequently than other kinds of updates.  (Never mind that data
shows otherwise.)


- FChE
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:02 am

Please check the popular distro called 'Fedora' for example, and its kernel 

Those 99% who click on the 'install 193 updates' popup.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 6:10 am

Of which 1 is the kernel, and 192 are userspace updates (of which one 
may be qemu).

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:31 am

I think you didnt understand my (tersely explained) point - which is probably 
my fault. What i said is:

 - distros update the kernel first. Often in stable releases as well if 
   there's a new kernel released. (They must because it provides new hardware
   enablement and other critical changes they generally cannot skip.)

 - Qemu on the other hand is not upgraded with (nearly) that level of urgency.
   Completely new versions will generally have to wait for the next distro
   release.

With in-kernel tools the kernel and the tooling that accompanies the kernel 
are upgraded in the same low-latency pathway. That is a big plus if you are 
offering things like instrumentation (which perf does), which relates closely 
to the kernel.

Furthermore, many distros package up the latest -git kernel as well. They 
almost never do that with user-space packages.

Let me give you a specific example:

I'm running Fedora Rawhide with 2.6.34-rc1 right now on my main desktop, and 
that comes with perf-2.6.34-0.10.rc1.git0.fc14.noarch.

My rawhide box has qemu-kvm-0.12.3-3.fc14.x86_64 installed. That's more than a 
1000 Qemu commits older than the latest Qemu development branch.

So by being part of the kernel repo there's lower latency upgrades and earlier 
and better testing available on most distros.

You made it very clear that you dont want that, but please dont try to claim 
that those advantages do not exist - they are very much real and we are making 
good use of it.

Thanks,

	Ingo
--

From: Daniel P. Berrange
Date: Thursday, March 18, 2010 - 6:44 am

This has nothing todo with them being in separate source repos. We could
update QEMU to new major feature releaes with the same frequency in a Fedora
release, but we delibrately choose not to rebase the QEMU userspace because 
experiance has shown the downside from new bugs / regressions outweighs the
benefit of any new features.

The QEMU updates in stable Fedora trees, now just follow the minor bugfix
release stream provided by QEMU & those arrive in Fedora with little
noticable delay.

Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:59 am

That is exactly what i said: Qemu and most user-space packages are on a 
'slower' update track than the kernel: generally updated for minor releases.

My further point was that the kernel on the other hand gets updated more 
frequently and as such, any user-space tool bits hosted in the kernel repo get 
updated more frequently as well.

Thanks,

	Ingo
--

From: John Kacur
Date: Thursday, March 18, 2010 - 7:06 am

Just to play devil's advocate, let's not mix up the development model with the
distribution model. There is nothing to stop packagers and distributors from
providing separate kernel "proper" packages and perf tools packages.

It might even make good sense assuming backwards compatibility for distros
that have conservative policies about new kernel versions to provide newer
perf tools packages with older kernels.

John
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 7:11 am

Of course. Some distros are also very conservative about updating the kernel 
at all.

I'm mostly talking about the distros that are at the frontier of kernel 
development: those with fresh packages, those which provide eager 
bleeding-edge testers and developers.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 6:46 am

No, they don't.  RHEL 5 is still on 2.6.18, for example.  Users don't 
like their kernels updated unless absolutely necessary, with good reason.


F12 recently updated to 2.6.32.  This is probably due to 2.6.31.stable 
dropping away, and no capacity at Fedora to maintain it on their own.  
So they are caught in a bind - stay on 2.6.31 and expose users to 
security vulnerabilities or move to 2.6.32 and cause regressions.  Not a 

I'm sure if we ask the Fedora qemu maintainer to package qemu-kvm.git 

I don't mind at all if rawhide users run on the latest and greatest, but 
release users deserve a little more stability.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:57 am

I just replied to Frank Ch. Eigler with a specific example that shows how this 


If you check the update frequency of RHEL 5 kernels you'll see that it's 

Happy choice or not, this is what i said is the distro practice these days. (i 

Rawhide is generally for latest released versions, to ready them for the next 
distro release - with special exception for the kernel, which has a special 
position due being a hardware-enabler and because it has an extremely 
predictable release schedule of every 90 days (+- 10 days).

Very rarely do distro people jump versions for things like GCC or Xorg or 
Gnome/KDE, but they've been burned enough times by unexpected delays in those 
projects to be really loathe to do it.

Qemu might get an exception - dunno, you could ask. My point still holds: by 
hosting KVM user-space bits in the kernel together with the rest of KVM you 
get version parity - which has clear advantages.


What are you suggesting, that released versions of KVM are not reliable? Of 
course any tools/ bits are release engineered just as much as the rest of KVM 
...

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 7:25 am

I'm sorry to say that's pretty bad.  Users don't want to update their 

So in addition to all the normal kernel regressions, you want to force 

No, I am suggesting qemu-kvm.git is not as stable as released versions 
(and won't get fixed backported).  Keep in mind that unlike many 
userspace applications, qemu exposes an ABI to guests which we must keep 
compatible.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 7:36 am

So instead you force a NxN compatibility matrix [all versions of qemu combined 
with all versions of the kernel] instead of a linear N versions matrix with a 
clear focus on the last version. Brilliant engineering i have to say ;-)

Also, by your argument the kernel should be split up into a micro-kernel, with 
different packages for KVM, scheduler, drivers, upgradeable separately.

That would be a nightmare. (i can detail many facets of that nightmare if you 
insist but i'll spare the electrons for now) Fortunately few kernel developers 

I think you still dont understand it: if a tool moves to the kernel repo, then 
it is _released stable_ together with the next stable kernel.

I.e. you'd get a stable qemu-2.6.34 in essence, when v2.6.34 is released. You 
get minor updates with 2.6.34.1, 2.6.34.2, 2.6.34.3, etc - while development 
continues.

I.e. you get _more_ stability, because a matching kernel is released with a 
matching Qemu.

Qemu might have a different release schedule. Which, i argue, is not a good 
thing for exactly that reason :-) If it moved to tools/kvm/ it would get the 
same 90 days release frequency, merge window and stabilization window 
treatment as the upstream kernel.

Furthermore, users can also run experimental versions of qemu together with 
experimental versions of the kernel, by running something like 2.6.34-rc1 on 
Rawhide. Even if they dont download the latest qemu git and build it.

I.e. clearly _more_ is possible in such a scheme.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 7:51 am

Thanks.  In fact with have an QxKxGxT compatibility matrix since we need 
to keep compatibility with guests and with tools.  Since the easiest 
interface to keep compatible is the qemu/kernel interface, allowing the 
kernel and qemu to change independently allows reducing the 
compatibility matrix while still providing some improvements.

Regardless of that I'd keep binary compatibility anyway.  Not everyone 
is on the update treadmill with everything updating every three months 

Some kernels do provide some of that facility (without being 
microkernels), for example the Windows and RHEL kernels.  So it seems 


I was confused by the talk about 2.6.34-rc1, which isn't stable.

-- 
error compiling committee.c: too many arguments to function

--

From: Frank Ch. Eigler
Date: Thursday, March 18, 2010 - 6:24 am

I do believe I've heard of it.  According to fedora bodhi, there have
been 18 kernel updates issues for fedora 11 since its release, of
which 12 were for purely security updates, and most of the other six
also contain security fixes.  None are described as 'enhancement'
updates.  Oh, what about fedora 12?  8 updates total, of which 5 are
security only, one for drm showstoppers, others including security
fixes, again 0 tagged as 'enhancement'.

So where is that "eagerness" again??  My sense is that most users are
happy to leave a stable kernel running as long as possible, and
distributions know this.  You surely must understand that the lkml


That's not "eager".  That's "I'm exasperated from guessing what's
really important; let's not have so many updates; meh".


- FChE
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:48 am

You are quite wrong, despite the sarcastic tone you are attempting to use, and 
this is distro kernel policy 101.

For distros such as Fedora it's simpler to support the same kernel version 
across many older versions of the distro than having to support different 
kernel versions.

Check Fedora 12 for example. Four months ago it was released with kernel 
v2.6.31:

 http://download.fedora.redhat.com/pub/fedora/linux/releases/12/Fedora/x86_64/os/Packag...

But if you update a Fedora 12 installation today you'll get kernel v2.6.32:

 http://download.fedora.redhat.com/pub/fedora/linux/updates/12/SRPMS/kernel-2.6.32.9-70...

As a result you'll get a new 2.6.32 kernel on Fedora 12.

The end result is what i said in the previous mail: that you'll get a newer 
kernel even on a stable distro - while user-space packages will only be 
updated if there's a security issue (and even then there's no version jump 

Erm, fact is, 99% [WAG] of the users click on the update button and accept 
whatever kernel version the distro update offers them.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 3:12 am

We would have exactly the same issues, only they would be in a single 
repository.  The only difference is that we could ignore potential 
alternatives to qemu, libvirt, and RHEV-M.  But that's not how kernel 
ABIs are developed, we try to make them general, not suited to just one 

In fact kvm started out in a single repo, and it certainly made it easy 
to bring it up in baby steps.  But we've long outgrown that.  Maybe the 
difference is that perf is still new and thus needs tight cooperation.  
If/when perf gains a real GUI, I doubt more than 1% of the patches will 

Very childish of them.  If someone wants to contribute to a userspace 
project, they can swallow their pride and send patches to a non-kernel 

Why is that?

I the maintainers of all packages are cooperative and responsive, then 
the patches will get accepted quickly.  If they aren't, development will 
be slow.  It isn't any different from contributing to two unrelated 
kernel subsystems (which are in fact in different repositories until the 

You're encouraging this with your proposal.  You're basically using the 

I can accept the bad design (not knowing any of the details), but how 

The wall is maybe four nanometers high.  Please be serious.  If someone 
wants to work on qemu usability all they have to do is to clone the 
repository and start sending patches to qemu-devel@.  What's gained by 
putting it in the kernel repository?  You're saving a minute's worth of 


I'm not saying that improved usability isn't a good thing, but time 
spent on improving the GUI is time not spent on the features that we 
really want.

Desktop oriented users also rarely test 16 vcpu guests with tons of RAM 
exercising 10Gb NICs and a SAN.  Instead they care about graphics 

It's hard to contribute a patch that goes against the architecture of 
the system, where kvm deals with cpu virtualization, qemu (or 
theoretically another tool) manages a guest, and libvirt (or another 
tool) manages the host.  You want a list of ...
From: Ingo Molnar
Date: Thursday, March 18, 2010 - 3:28 am

Not at all - as i replied to in a previous mail, tools/perf/ still has a clear 
userspace interface and ABI, and external projects are making use of it.

So there's no problem with the ABI at all.

In fact our experience has been the opposite: the perf ABI is markedly better 
_because_ there's an immediate consumer of it in the form of tools/perf/. It 
gets tested better and external projects can get their ABI tweaks in as well 
and can provide a reference implementation for tools/perf. This has happened a 
couple of times. It's a win-win scenario.

So the exact opposite of what you suggest is happening in practice.

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 3:50 am

It's very simple: because the contribution latencies and overhead compound, 
almost inevitably.

If you ever tried to implement a combo GCC+glibc+kernel feature you'll know 
...

Even with the best-run projects in existence it takes forever and is very 

I'm afraid practice is different from the rosy ideal you paint there. Even 
with assumed 'perfect projects' there's always random differences between 
projects, causing doubled (tripled) overhead and compounded up overhead:

 - random differences in release schedules

 - random differences in contribution guidelines


You mention a perfect example: contributing to multipe kernel subsystems. Even 
_that_ is very noticeably harder than contributing to a single subsystem - due 
to the inevitable buerocratic overhead, due to different development trees, 
due to different merge criteria.

So you are underlining my point (perhaps without intending to): treating 
closely related bits of technology as a single project is much better.

Obviously arch/x86/kvm/, virt/ and tools/kvm/ should live in a single 
development repository (perhaps micro-differentiated by a few topical 
branches), for exactly those reasons you mention.

Just like tools/perf/ and kernel/perf_event.c and arch/*/kernel/perf*.c are 
treated as a single project.

[ Note: we actually started from a 'split' design [almost everyone picks that, 
  because of this false 'kernel space bits must be separate from user space 
  bits' myth] where the user-space component was a separate code base and 
  unified it later on as the project progressed.

  Trust me, the practical benefits of the unified approach are enormous to 
  developers and to users alike, and there was no looking back once we made 
  the switch. ]

Also, i dont really try to 'convince' you here - you made your position very 
clear early on and despite many unopposed technical arguments i made, the 
positions seem to have hardened and i expect it wont change, no matter what 
arguments i bring. ...
From: Avi Kivity
Date: Thursday, March 18, 2010 - 4:30 am

It's not inevitable, if the projects are badly run, you'll have high 



How is a patch for the qemu GUI eject button and the kvm shadow mmu 
related?  Should a single maintainer deal with both?


-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 4:48 am

We have co-maintainers for perf that have a different focus. It works pretty 
well.

Look at git log tools/perf/ and how user-space and kernel-space components 
interact in practice. You'll patches that only impact one side, but you'll see 
very big overlap both in contributor identity and in patches as well.

Also, let me put similar questions in a bit different way:

 - ' how is an in-kernel PIT emulation connected to Qemu's PIT emulation? '

 - ' how is the in-kernel dynticks implementation related to Qemu's 
     implementation of hardware timers? '

 - ' how is an in-kernel event for a CD-ROM eject connected to an in-Qemu 
     eject event? '

 - ' how is a new hardware virtualization feature related to being able to 
     configure and use it via Qemu? '

 - ' how is the in-kernel x86 decoder/emulator related to the Qemu x86 
     emulator? '

 - ' how is the performance of the qemu GUI related to the way VGA buffers are 
     mapped and accelerated by KVM? '

They are obviously deeply related. The quality of a development process is not 
defined by the easy cases where no project unification is needed. The quality 
of a development process is defined by the _difficult_ cases.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 5:22 am

Where people sent patches, it doesn't suck (or sucks less).  Where they 
don't, it still sucks.  And it cost way more than $64K.


And it works well when I have patches that change x86 core and kvm.  But 

Both implement the same spec.  One is be a code derivative of the other 

The quality of host kernel timers directly determines the quality of 

Both implement the same spec.  The kernel of course needs to handle all 

Most features (example: npt) are transparent to userspace, some are 
not.  When they are not, we introduce an ioctl() to kvm for controlling 

Both implement the same spec.  Note qemu is not an emulator but a binary 

kvm needs to support direct mapping when possible and efficient data 
transfer when not.  The latter will obviously be much slower.  When 
direct mapping is possible, kvm needs to track pages touched by the 
guest to avoid full screen redraws.  The rest (interfacing to X or vnc, 
implementing emulated hardware acceleration, full-screen mode, etc.) are 

Not at all.  kvm in fact knows nothing about vga, to take your last 
example.  To suggest that qemu needs to be close to the kernel to 
benefit from the kernel's timer implementation means we don't care about 
providing quality timing except to ourselves, which luckily isn't the case.

Some time ago the various desktops needed directory change notification, 
and people implemented inotify (or whatever it's called today).  No one 

That's true, but we don't have issues at the qemu/kvm boundary.  Note we 
do have issues at the qemu/aio interfaces and qemu/net interfaces (out 
of which vhost-net was born) but these wouldn't be solved by tools/qemu/.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 6:00 am

So is your point that the development process and basic code structure does 
not matter at all, it's just a matter of people sending patches? I beg to 

Those bits of Fedora which deeply relate to the kernel - yes.

Actually, it works much better if, contrary to your proposal it ends up in a 
single repo. Last i checked both of us really worked on such a project, run by 

You are obviously arguing for something like UML. Fortunately KVM is not that. 

Look at the VGA dirty bitmap optimization a'ka the KVM_GET_DIRTY_LOG ioctl.

See qemu/kvm-all.c's kvm_physical_sync_dirty_bitmap().

It started out as a VGA optimization (also used by live migration) and even 
today it's mostly used by the VGA drivers - albeit a weak one.

I wish there were stronger VGA optimizations implemented, copying the dirty 
bitmap is not a particularly performant solution. (although it's certainly 
better than full emulation) Graphics performance is one of the more painful 

That is not what i said. I said they are closely related, and where 
technologies are closely related, project proximity turns into project 

You are misconstruing and misrepresenting my argument - i'd expect better. 
Gnome and KDE runs on other kernels as well and is generally not considered 
close to the kernel.


That was not what i suggested. They would be solved by what i proposed: 
tools/kvm/, right?

Thanks,

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 6:36 am

The development process of course matters, and we have worked hard to 
fix qemu's.  Basic code structure also matters, but you don't fix that 


Well, when last I sent x86 patches, they went to you and hpa, applied to 
tip, from which I had to merge them back.  Two repositories.  After 
several weeks they did end up in a third repository, Linus'.  The 


The VGA dirty bitmap is 256 bytes in length.  Copying it doesn't take 
any time at all.

People are in fact working on a copy-less dirty bitmap solution, for 
live migration of very large memory guests.  Expect set_bit_user() 

If you have suggestions for further optimizations (or even patches) I'd 
love to hear them.

One solution we are working on is QXL, a framebuffer-less graphics card 
designed for spice.  The use case is again server based (hosted 

I really don't see how.  So what if both qemu and kvm implement an 
i8254?  They can't share any code since the internal APIs are so 
different.  Even worse for the x86 emulator as qemu and kvm are 
fundamentally different.  Even more with the qemu timers and kernel 


The vast majority of qemu has nothing to do with kvm, all the kvm 
interface bits are in two files.  Things like the GUI, the VNC server, 
IDE emulation, the management interface (the monitor), live migration, 
qcow2 and ~15 other file format drivers, chipset emulation, USB 
controller emulation, snapshot support, slirp, serial port emulation, 

If they were, it would be worth it.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 7:09 am

I wouldnt jump to assumptions there. perf shares some facilities with the 
kernel on the source code level - they can be built both in the kernel and in 
user-space.

But my main thought wasnt even to actually share the implementation - but to 
actually synchronize when a piece of device emulation moves into the kernel. 
It is arguably bad for performance in most cases when Qemu handles a given 
device - so all the common devices should be kernel accelerated.

The version and testing matrix would be simplified significantly as well: as 

So is it your argument that the difference and the duplication in x86 
instruction emulation is a good thing? You said it some time ago that
the kvm x86 emulator was very messy and you wish it was cleaner.

While qemu's is indeed rather different (it's partly a translator/JIT), i'm 
sure the decoder logic could be shared - and qemu has a slow-path 
full-emulation fallback in any case, which is similar to what in-kernel 
emulator does (IIRC ...).

That might have changed meanwhile.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 7:38 am

So, you propose to allow running tools/kvm/ only on the kernel it was 
shipped with?


Of course it isn't a good thing, but it is unavoidable.  Qemu compiles 
code just-in-time to avoid interpretation overhead, while kvm emulates 
one instruction at a time.  No caching is possible, especially with 
ept/npt, since the guest is free to manipulate memory with no 
notification to the host.  Qemu also supports the full instruction set 
while kvm only implements what is necessary.  Qemu is a 


IIUC it only ever translates.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 10:16 am

It is, because testing is more focused and more people are testing the 
combination that developers tested as well. (and not some random version 
combination picked by the distributor or the user)

	Ingo
--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 7:59 am

We have to maintain a dirty bitmap because we don't have a paravirtual 
graphics driver.  IOW, someone needs to write an Xorg driver.

Ideally, we could just implement a Linux framebuffer device, right?  
Well, we took that approach in Xen and that sucks even worse because the 
Xorg framebuffer driver doesn't implement any of the optimizations that 
the Linux framebuffer supports and the Xorg driver does not provide use 
the kernel's interfaces for providing update regions.

Of course, we need to pull in X into the kernel to fix this, right?

Any sufficiently complicated piece of software is going to interact with 
a lot of other projects.  The solution is not to pull it all into one 
massive repository.  It's to build relationships and to find ways to 
efficiently work with the various communities.

And we're working on this with X.  We'll have a paravirtual graphics 
driver very soon.  There are no magic solutions.  We need more 
developers working on the hard problems.

Regards,

Anthony Liguori
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 8:17 am

No, you'd want to interact with DRM.

( Especially as you want to write guest accelerators passing guest-space 
  OpenGL requests straight to the kernel DRM level. )

Especially if you want to do things like graphics card virtualization, with 
aspects of the graphics driver passed through to the guest OS.

There are all kernel space projects, going through Xorg would be a horrible 
waste of performance for full-screen virtualization. It's fine for the 
windowed or networked case (and good as a compatibility fallback), but very 

FYI, this part of X has already been pulled into the kernel, it's called DRM. 

That's my whole point with this thread: the kernel side of KVM and qemu, but 
all practical purposes should not be two 'separate communities'. They should 
be one and the same thing.

Separation makes sense where the relationship is light or strictly 
hierarchical - here it's neither. KVM and Qemu is interconnected, quite 

The thing is, writing up a DRM connector to a guest Linux OS could be done in 
no time. It could be deployed to users in no time as well, with the proper 
development model.

That after years and years of waiting proper GX support is _still_ not 
implemented in KVM is really telling of the efficiency of development based on 
such disjoint 'communities'. Maybe put up a committee as well to increase 
efficiency? ;-)

	Ingo
--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 9:38 am

I don't think I've ever used full-screen mode with my VMs and I use 
virtualization on a daily basis.




I don't see any actual KVM developer complaining about this so I'm not 

We lose a huge amount of users and contributors if we put QEMU in the 
Linux kernel.  As I said earlier, a huge number of our contributions 

We've tried to create a "clean" version of QEMU specifically for KVM.  
Moving it into tools/kvm would be the second step.  We've all failed on 

If the problem is combining the two, I've sent you a patch that you can 
put into tip.git if you're so inclined.

Regards,


--

From: Pekka Enberg
Date: Thursday, March 18, 2010 - 9:51 am

Sorry for getting slightly off-topic but I find the above statement interesting.

I don't use virtualization on daily basis but a working, fully
integrated full-screen model with VirtualBox was the only reason I
bothered to give VMs a second chance. From my point of view, the user
experience of earlier versions (e.g. Parallels) was just too painful
to live with.

/me crawls back to his hole now...

                        Pekka
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 10:02 am

That's the same i do, and that's what i'm hearing from other desktop users as 
well.

The moment you work seriously in a guest OS you often want to switch to it 
full-screen, to maximize screen real-estate and to reduce host GUI element 
distractions. If it's just casual use of a single app then windowed mode 
suffices (but in that case performance doesnt matter much to begin with).

I find the 'KVM mostly cares about the server, not about the desktop' attitude 

/me should do that too - this discussion is not resulting in any positive 
result so it has become rather pointless.

	Ingo
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 10:09 am

It's not kvm, just it's developers (and their employers, where 
applicable).  If you post desktop oriented patches I'm sure they'll be 
welcome.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 10:28 am

Just such a patch-set was posted in this very thread: 'perf kvm'.

There were two negative reactions immediately, both showed a fundamental 
server versus desktop bias:

 - you did not accept that the most important usecase is when there is a
   single guest running.

 - the reaction to the 'how do we get symbols out of the guest' sub-question 
   was, paraphrased: 'we dont want that due to <unspecified> security threat 
   to XYZ selinux usecase with lots of guests'.

Anyone being aware of how Linux and KVM is being used on the desktop will know 
how detached that attitude is from the typical desktop usecase ...

Usability _never_ sucks because of lack of patches or lack of suggestions. I 
bet if you made the next server feature contingent on essential usability 
fixes they'd happen overnight - for God's sake there's been 1000 commits in 
the last 3 months in the Qemu repository so there's plenty of manpower...

Usability suckage - and i'm not going to be popular for saying this out loud - 
almost always shows a basic maintainer disconnect with the real world. See 
your very first reactions to my 'KVM usability' observations. Read back your 
and Anthony's replies: total 'sure, patches welcome' kind of indifference. It 
is _your project_, not some other project down the road ...

So that is my first-hand experience about how you are welcoming these desktop 
issues, in this very thread. I suspect people try a few times with 
suggestions, then get shot down like our suggestions were shot down and then 
give up.

	Ingo
--

From: Avi Kivity
Date: Friday, March 19, 2010 - 12:56 am

When I review a patch, I try to think of the difficult cases, not just 

First of all I am not a qemu maintainer.  Second, from my point of view 
all contributors are volunteers (perhaps their employer volunteered 
them, but there's no difference from my perspective).  Asking them to 
repaint my apartment as a condition to get a patch applied is abuse.  If 

I could drop everything and write a gtk GUI for qemu.  Is that what you 
want?

If someone is truly interested in a qemu usability, it's up to them to 
write the patches.  Personally I've never missed the eject button.

As to disconnect from the real world, most products based on kvm and 
qemu (and Linux) are server based.  Perhaps that's the reason people 
emphasise that?  Maybe if Linux had 10-20% desktop market penetration, 

I don't recall anyone trying this much less being shot down.  Perhaps 
people are concentrating on virt-manager and the like and leaving qemu 
alone.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Friday, March 19, 2010 - 1:53 am

Erm, my usability points are _doubly_ true when there are multiple guests ...

The inconvenience of having to type:

  perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms \
  --guestmodules=/home/ymzhang/guest/modules top

is very obvious even with a single guest. Now multiply that by more guests ...

The crux is: we are working on improving KVM instrumentation. There are 
working patches posted to this thread and we would like to have/implement an 
automatism to allow the discovery of all this information. The information 
should be available to the developer who wants it, and easily/transparently so 

You havent articulated an actionable reason and you have suggested no solution 
either, you just passive-agressive backed the claim that giving developers 
access to the symbol space is some sort of vague 'security threat'.


That is the crux of the matter. My experience in these threads was that no-one 
really seems to feel in charge of the whole thing. Should we really wonder why 

This is one of the weirdest arguments i've seen in this thread. Almost all the 
time do we make contributions conditional on the general shape of the project. 
Developers dont get to do just the fun stuff.

This is a basic quid pro quo: new features introduce risks and create 
additional workload not just to the originating developer but on the rest of 
the community as well. You should check how Linus has pulled new features in 
the past 15 years: he very much requires the existing code to first be 
top-notch before he accepts new features for a given area of functionality.

Doing that and insisting on developers to see those imbalances as well is 
absolutely essential to code quality: otherwise everyone would be running 
around implementing just the features they are interested in, without regard 
for the general health of the project.

Of course, if you keep the project in two halves (KVM and Qemu), and pretend 
that they are separate and have little relation, ...
From: Anthony Liguori
Date: Friday, March 19, 2010 - 5:56 am

If you want to improve this, you need to do the following:

1) Add a userspace daemon that uses vmchannel that runs in the guest and 
can fetch kallsyms and arbitrary modules.  If that daemon lives in 
tools/perf, that's fine.
2) Add a QMP interface in qemu to interact with such daemon
3) Add a default QMP port in a well known location[1]
4) Modify the perf tool to look for a default QMP port.  In the case of 
a single guest, there's one port.  If there are multiple guests, then 
you will have to connect to each port, find the name or any other 
identifying information, and let the user choose.

Patches are certainly welcome.

[1] I've written up this patch and will send it out some time today.

Regards,

Anthony Liguori

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 12:17 pm

Adding any new daemon to an existing guest is a deployment and usability 
nightmare.

The basic rule of good instrumentation is to be transparent. The moment we 
have to modify the user-space of a guest just to monitor it, the purpose of 
transparent instrumentation is defeated.

That was one of the fundamental usability mistakes of Oprofile.

There is no 'perf' daemon - all the perf functionality is _built in_, and for 
very good reasons. It is one of the main reasons for perf's success as well.

Now Qemu is trying to repeat that stupid mistake ...

So please either suggest a different transparent solution that is technically 
better than the one i suggested, or you should concede the point really.

Please try think with the heads of our users and developers and dont suggest 
some weird ivory-tower design that is totally impractical ...

And no, you have to code none of this, we'll do all the coding. The only thing 
we are asking is for you to not stand in the way of good usability ...

Thanks,

	Ingo
--

From: Antoine Martin
Date: Sunday, March 21, 2010 - 12:35 pm

Absolutely. In most cases it is not desirable, and you'll find that in a 
lot of cases it is not even possible - for non-technical reasons.
One of the main benefits of virtualization is the ability to manage and 
Not to mention Heisenbugs and interference.

Cheers

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 12:59 pm

Correct.

Frankly, i was surprised (and taken slightly off base) by both Avi and Anthony 
suggesting such a clearly inferior "add a demon to the guest space" solution. 
It's a usability and deployment non-starter.

Furthermore, allowing a guest to integrate/mount its files into the host VFS 
space (which was my suggestion) has many other uses and advantages as well, 
beyond the instrumentation/symbol-lookup purpose.

So can we please have some resolution here and move on: the KVM maintainers 
should either suggest a different transparent approach, or should retract the 
NAK for the solution we suggested.

We very much want to make progress and want to write code, but obviously we 
cannot code against a maintainer NAK, nor can we code up an inferior solution 
either.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 1:09 pm

It's only clearly inferior if you ignore every consideration against 
it.  It's definitely not a deployment non-starter, see the tons of 
daemons that come with any Linux system.  The basic ones are installed 


So long as you define 'transparent' as in 'only the guest kernel is 
involved' or even 'only the guest and host kernels are involved' we 
aren't going to make a lot of progress.  I oppose shoving random bits of 
functionality into the kernel, especially things that are in daily use.  
While us developers do and will use profiling extensively, it doesn't 

You haven't heard any NAKs, only objections.  If we discuss things 
perhaps we can achieve something that works for everyone.  If we keep 
turning the flames higher that's unlikely.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 2:00 pm

Avi, please dont put arguments into my mouth that i never made.

My (clearly expressed) argument was that:

    _a new guest-side demon is a transparent instrumentation non-starter_

What is so hard to understand about that simple concept? Instrumentation is 
good if it's as transparent as possible.

Of course lots of other features can be done via a new user-space package ...

Thanks,

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 2:44 pm

Sorry, that was not the intent.  I meant that putting things into the 

I believe you can deploy this daemon via a (default) package, without 
any hassle to users.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Anthony Liguori
Date: Sunday, March 21, 2010 - 4:43 pm

FWIW, there's no reason you couldn't consume a vmchannel port from 
within the kernel.  I don't think the code needs to be in the kernel and 
from a security PoV, that suggests that it should be in userspace IMHO.

But if you want to make a kernel thread, knock yourself out.  I have no 
objection to that from a qemu perspective.  I can't see why Avi would 
mind either.  I think it's papering around another problem (the kernel 
should control initrds IMHO) but that's a different topic.

Regards,

Anthony Liguori

--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 1:01 pm

The logical conclusion of that is that everything should be built into 
the kernel.  Where a failure brings the system down or worse.  Where you 
have to bear the memory footprint whether you ever use the functionality 
or not.  Where to update the functionality you need to deploy a new 
kernel (possibly introducing unrelated bugs) and reboot.

If userspace daemons are such a deployment and usability nightmare, 


inetd.d style 'drop a listener config here and it will be executed on 
connection' should work.  The listener could come with the kernel 
package, though I don't think it's a good idea.  module-init-tools 

Thanks.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Olivier Galibert
Date: Sunday, March 21, 2010 - 1:08 pm

Which userspace?  Deploying *anything* in the guest can be a
nightmare, including paravirt drivers if you don't have a natively
supported in the OS virtual hardware backoff.  Deploying things in the
host OTOH is business as usual.

And you're smart enough to know that.

  OG.
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 1:11 pm

That includes the guest kernel.  If you can deploy a new kernel in the 


Thanks.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Antoine Martin
Date: Sunday, March 21, 2010 - 1:18 pm

That's not always true.
The host admin can control the guest kernel via "kvm -kernel" easily 
enough, but he may or may not have access to the disk that is used in 
the guest. (think encrypted disks, service agreements, etc)


--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 1:24 pm

There is a matching -initrd argument that you can use to launch a 
daemon.  I believe that -kernel use will be rare, though.  It's a lot 
easier to keep everything in one filesystem.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Antoine Martin
Date: Sunday, March 21, 2010 - 1:31 pm

I thought this discussion was about making it easy to deploy... and 
generating a custom initrd isn't easy by any means, and it requires 
Well, for what it's worth, I rarely ever use anything else. My virtual 
disks are raw so I can loop mount them easily, and I can also switch my 
guest kernels from outside... without ever needing to mount those disks.

--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 2:03 pm

That's true.  You need to run mkinitrd anyway, though, unless your guest 

Curious, what do you use them for?

btw, if you build your kernel outside the guest, then you already have 
access to all its symbols, without needing anything further.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 2:20 pm

There's two errors with your argument:

1) you are assuming that it's only about kernel symbols

Look at this 'perf report' output:

# Samples: 7127509216
#
# Overhead     Command                  Shared Object  Symbol
# ........  ..........  .............................  ......
#
    19.14%         git  git                            [.] lookup_object
    15.16%        perf  git                            [.] lookup_object
     4.74%        perf  libz.so.1.2.3                  [.] inflate
     4.52%         git  libz.so.1.2.3                  [.] inflate
     4.21%        perf  libz.so.1.2.3                  [.] inflate_table
     3.94%         git  libz.so.1.2.3                  [.] inflate_table
     3.29%         git  git                            [.] find_pack_entry_one
     3.24%         git  libz.so.1.2.3                  [.] inflate_fast
     2.96%        perf  libz.so.1.2.3                  [.] inflate_fast
     2.96%         git  git                            [.] decode_tree_entry
     2.80%        perf  libc-2.11.90.so                [.] __strlen_sse42
     2.56%         git  libc-2.11.90.so                [.] __strlen_sse42
     1.98%        perf  libc-2.11.90.so                [.] __GI_memcpy
     1.71%        perf  git                            [.] decode_tree_entry
     1.53%         git  libc-2.11.90.so                [.] __GI_memcpy
     1.48%         git  git                            [.] lookup_blob
     1.30%         git  git                            [.] process_tree
     1.30%        perf  git                            [.] process_tree
     0.90%        perf  git                            [.] tree_entry
     0.82%        perf  git                            [.] lookup_blob
     0.78%         git  [kernel.kallsyms]              [k] kstat_irqs_cpu

kernel symbols are only a small portion of the symbols. (a single line in this 
case)

To get to those other symbols we have to read the ELF symbols of those 
binaries in the guest filesystem, in ...
From: Avi Kivity
Date: Sunday, March 21, 2010 - 11:35 pm

Okay.  So a symbol server is necessary.  Still, I don't think -kernel is 
a good reason for including the symbol server in the kernel itself.  If 
someone uses it extensively together with perf, _and_ they can't put the 
symbol server in the guest for some reason, let them patch mkinitrd to 

What about line number information?  And the source?  Into the kernel 

I've read every one of your emails.  If I misunderstood or overlooked 
something, I apologize.  The thread is very long and at times 
antagonistic so it's hard to keep all the details straight.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 4:48 am

Sigh. Please read the _very first_ suggestion i made, which solves all that. I 
rarely go into discussions without suggesting technical solutions - i'm not 
interested in flaming, i'm interested in real solutions.

Here it is, repeated for the Nth time:

Allow a guest to (optionally) integrate its VFS namespace with the host side 
as well. An example scheme would be:

   /guests/Fedora-G1/
   /guests/Fedora-G1/proc/
   /guests/Fedora-G1/usr/
   /guests/Fedora-G1/.../
   /guests/OpenSuse-G2/
   /guests/OpenSuse-G2/proc/
   /guests/OpenSuse-G2/usr/
   /guests/OpenSuse-G2/.../

  ( This feature would be configurable and would be default-off, to maintain 
    the current status quo. )

Line number information and the source (dwarf info) and ELF symbols are all 
provided and accessible via such an interface - no need to run any 'symbol 
demon' on the guest side.

And, obviously, having the guest VFS namespace (optionally) available on the 
host side also has far more uses than perf's symbol needs.

I was surprised no-one ever came up with such a suggestion - it is so obvious 
to allow the integration of the VFS namespaces. But given your explicit 
declaration of your KVM desktop usability indifference i'm kind of not 
surprised about that anymore.

Thanks,

	Ingo
--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 5:31 am

Heh, funny. That would also solve my number one gripe with
virtualization these days: how to get files in and out of guests
without having to install extra packages on the guest side and
fiddling with mount points on every single guest image I want to play
with.

                        Pekka
--

From: Daniel P. Berrange
Date: Monday, March 22, 2010 - 5:37 am

FYI, for offline guests, you can use libguestfs[1] to access & change files
inside the guest, and read-only access to running guests files. It provides
access via a interactive shell, APIs in all major languages, and also has a
FUSE mdule to expose it directly in the host VFS.  It could probably be made
to work read-write for running guests too if its agent were installed inside
the guest & leverage the new Virtio-Serial channel for comms (avoiding any
network setup requirements).

Regards,
Daniel

[1] http://libguestfs.org/
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 5:44 am

Hi Daniel,

(I'm getting slightly off-topic, sorry about that.)


Right. Thanks for the pointer.

The use case I am thinking of is working on an userspace project and 
wanting to test a piece of code on multiple distributions before pushing 
it out. That pretty much means being able to pull from the host git 
repository (or push to the guest repo) while the guest is running, maybe 
changing the code a bit and then getting the changes back to the host 
for the final push.

What I do now is I push the changes on the host side to a (private) 
remote branch and do the work through that. But that's pretty lame 
workaround in my opinion.

			Pekka
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 5:54 am

Yes, this is the kind of functionality i'm suggesting.

I'd suggest a different implementation for live guests: to drive this from 
within the live guest side of KVM, i.e. basically a paravirt driver for 
guestfs. You'd pass file API guests to the guest directly, via the KVM ioctl 
or so - and get responses from the guest.

That will give true read-write access and completely coherent (and still 
transparent) VFS integration, with no host-side knowledge needed for the 
guest's low level (raw) filesystem structure. That's a big advantage.

Yes, it needs an 'aware' guest kernel - but that is a one-off transition 
overhead whose cost is zero in the long run. (i.e. all KVM kernels beyond a 
given version would have this ability - otherwise it's guest side distribution 
transparent)

Even 'offline' read-only access could be implemented by booting a minimal 
kernel via qemu -kernel and using a 'ro' boot option. That way you could 
eliminate all lowlevel filesystem knowledge from libguestfs. You could run 
ext4 or btrfs guest filesystems and FAT ones as well - with no restriction.

This would allow 'offline' access to Windows images as well: a FAT or ntfs 
enabled mini-kernel could be booted in read-only mode.

Thanks,

	Ingo
--

From: Daniel P. Berrange
Date: Monday, March 22, 2010 - 6:05 am

This is close to the way libguestfs already works. It boots QEMU/KVM pointing
to a minimal stripped down appliance linux OS image, containing a small agent
it talks to over some form of vmchannel/serial/virtio-serial device. Thus the
kernel in the appliance it runs is the only thing that needs to know about the
filesystem/lvm/dm on-disk formats - libguestfs definitely does not want to be
duplicating this detailed knowledge of on disk format itself. It is doing
full read-write access to the guest filesystem in offline mode - one of the
major use cases is disaster recovery from a unbootable guest OS image.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Richard W.M. Jones
Date: Monday, March 22, 2010 - 6:23 am

As Dan said, the 'daemon' part is separate and could be run as a
standard part of a guest install, talking over vmchannel to the host.
The only real issue I can see is adding access control to the daemon
(currently it doesn't need it and doesn't do any).  Doing it this way
you'd be leveraging the ~250,000 lines of existing libguestfs code,
bindings in multiple languages, tools etc.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
New in Fedora 11: Fedora Windows cross-compiler. Compile Windows
programs, test, and build Windows installers. Over 70 libraries supprt'd
http://fedoraproject.org/wiki/MinGW http://www.annexia.org/fedora_mingw
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:02 am

I think it would be a nice option to allow such guest-side "daemon's" to be 
executed in the guest context without _any_ guest-side support.

This would be possible by building such minimal daemons that use vmchannel, 
and which are built for generic x86 (maybe even built for 32-bit x86 so that 
they can run on any x86 distro). They could execute as the init task of any 
guest kernel - Qemu could 'blend in / replace' the binary as the init task of 
the guest temporarily - and some simple bootstrap code could then start the 
daemon and start the real init binary (and turn off the 'blending' of the init 
task).

That way any guest could be extended via such Qemu functionality - even 
without any kernel changes. Has anyone thought about (or coded) such a 
solution perhaps?

	Ingo
--

From: oerg Roedel
Date: Monday, March 22, 2010 - 7:20 am

I think we don't need per-guest-file access control. Probably we could
apply the image-file permissions to all guestfs files. This would cover
the usecases:

	* perf for reading symbol information (needs ro-access only
	  anyway)
	* Desktop like host<->guest file copy

I have not looked into libguestfs yet but I guess this approach is
easier to achieve.

	Joerg

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 6:56 am

[ Oops, you are right - sorry for not looking more closely! I was confused by

Just curious: any plans to extend this to include live read/write access as 
well?

I.e. to have the 'agent' (guestfsd) running universally, so that tools such as 
perf and by users could rely on the VFS integration as well, not just disaster 
recovery tools?

Without universal access to this feature it's not adequate for instrumentation 
purposes.

One option to achieve that would be to extend Qemu to allow 'qemu daemons' to 
run on the (Linux) guest side. These would be statically linked binaries that 
can run on any Linux system, and which could provide various built-in Qemu 
functionality from the guest side to the host side.

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:07 am

By default i'd suggest to put it into a maximally restricted mount point. I.e. 
restrict access to only the security context running libguestfs or so.

( Which in practice will be the user starting the guest, so there will be 
  proper protection from other users while still allowing easy access to the 
  user that has access already. )

	Ingo
--

From: Richard W.M. Jones
Date: Monday, March 22, 2010 - 7:01 am

Totally.  That's not to say there is a definite plan, but we're very
open to doing this.  We already wrote the daemon in such a way that it
doesn't require the appliance part, but could run inside any existing
guest (we've even ported bits of it to Windoze ...).

The only remaining issue is how access control would be handled.  You
obviously wouldn't want anything in the host that can get access to
the vmchannel socket to start sending destructive write commands into
guests.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://et.redhat.com/~rjones/virt-df/
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 5:36 am

[...]

You're missing something.  This sub-thread is about someone launching a 
kernel with 'qemu -kernel', the kernel lives outside the guest disk 
image, they don't want a custom initrd because it's hard to make.

-- 
error compiling committee.c: too many arguments to function

--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 5:50 am

Well, you know, I am missing your point here about initrd. Surely the
guest kernels need to use sys_mount() at some point at which time they
could just tell the host kernel where they can find the mount points?
But maybe we're not talking about that kind of scenario here?

                        Pekka
--

From: Zhang, Yanmin
Date: Sunday, March 21, 2010 - 11:59 pm

Above example shows perf could summarize both kernel and application hot functions.
If we collect guest os statistics from host side, we can't summarize detailed guest os
application info because we couldn't get guest os's application process id from host
side. So we could only get detailed kernel info and the total utilization percent of


--

From: Antoine Martin
Date: Monday, March 22, 2010 - 5:05 am

Various things, here is one use case which I think is under-used: 
read-only virtual disks with just one network application on them (no 
runlevels, sshd, user accounts, etc), a hell of a lot easier to maintain 
and secure than a full blown distro. Want a new kernel? boot a new VM 
and swap it for the old one with zero downtime (if your network app 
supports this sort of hot-swap - which a lot of cluster apps do)

Another reason for wanting to keep the kernel outside is to limit the 
potential points of failure: remove the partition table, remove the 
bootloader, remove even the ramdisk. Also makes it easier to switch to 
another solution (say UML) or another disk driver (as someone mentioned 
previously).
In virtualized environments I often prefer to remove the ability to load 
kernel modules too, for obvious reasons.

Hope this helps.

Antoine
--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 1:37 pm

Note that with perf we can instrument the guest with zero guest-kernel 
modifications as well.

We try to reduce the guest impact to a bare minimum, as the difficulties in 
deployment are function of the cross section surface to the guest.

Also, note that the kernel is special with regards to instrumentation: since 
this is the kernel project, we are doing kernel space changes, as we are doing 
them _anyway_. So adding symbol resolution capabilities would be a minimal 
addition to that - while adding a while new guest package for the demon would 
significantly increase the cross section surface.

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 11:37 pm

It's true that for us, changing the kernel is easier than changing the 
rest of the guest.  IMO we should still resist the temptation to go the 
easy path and do the right thing (I understand we disagree about what 
the right thing is).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 4:39 am

It is not about the 'temptation to go the easy path'.

It is about finding the most pragmatic approach and realizing the cost of 
inaction: sucky Linux, sucky KVM.

Let me give you an example: Linus's commit in v2.6.30 that changed the 
user-space policy of the EXT3 filesystem to make it more desktop capable:

  bbae8bc: ext3: make default data ordering mode configurable

That changes was opposed vehemently with your kind of arguments: "such changes 
should be done by the distributions", "it should be done correctly", "the 
kernel should not implement policy", etc..

I can also tell you that this commit improved my desktop experience 
incredibly. Still, distros didnt do it for almost a decade of ext3 existence. 
Why?

Truth is that those kinds of "do it right" arguments are mistaken because they 
assume that we live in an ideal, 'perfect market' where all inefficiencies 
will get eliminated in the long run.

In reality the "market" for OSS software is imperfect:

 - there's marginal costs of action - a too small change has difficulty 
   getting over that

 - there's costs of modularization (which are both technical and social)

 - there's the power of the status quo acting against marginally good changes

 - there's the power of entropy ripping Linux distributions apart making
   all-distro changes harder 

So the solution to the "why dont the distributions do this" question you pose 
is exactly what i propose: _give a default, reference implementation of KVM 
tooling that has to be eclipsed_.

There's the unique position of the kernel that it can impose sanity in a more 
central way which acts as a reference implementation.

I.e. the kernel can very much improve quality all across the board by 
providing a sane default (in the ext3 case) - or, as in the case of perf, by 
providing a sane 'baseline' tooling. It should do the same for KVM as well.

If we dont do that, Linux will eventually stop mattering on the desktop - and 
some time after that, it will ...
From: Avi Kivity
Date: Monday, March 22, 2010 - 5:44 am

Yet Linux is gaining ground in the server and embedded space while 
struggling on the desktop.  Apple is gaining ground on the desktop but 
is invisible on the server side (despite having a nice product - Xserve).

It's true Windows achieved server dominance through it's desktop power, 
but I don't think that's what keeping them there now.

In any case, I'm not going to write a kvm GUI.  It doesn't match my 
skills, interest, or my employer's interest.  If you wish to see a kvm 
GUI you have to write one yourself or convince someone to write it 
(perhaps convince Red Hat to fund such an effort beyond virt-manager).

-- 
error compiling committee.c: too many arguments to function

--

From: Daniel P. Berrange
Date: Monday, March 22, 2010 - 5:54 am

It is planned to add support for SPICE remote desktop to virt-manager
once that matures & is accepted into upstream KVM/QEMU. That will improve
the guest/desktop interaction in many ways compared to VNC or SDL, with
improved display resolution changing, copy+paste between host & guest,
much better graphics performance, etc. 

Development efforts aren't totally ignoring the desktop, more that they
are focusing on remoting guest desktops, rather than interaction host 
desktop since that's where alot of demand is. This benefits single host
desktops scenarios too, since there's alot of overlap in the problems
faced there.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:26 am

Frankly, Linux is mainly growing in the server space due to:

 1) the server space is technically much simpler than the desktop space. It
    is far easier to code up a server performance feature than to make
    struggle through stupid (server-motivated) package boundaries and get
    something done on the desktop. It is far easier to code up a server app
    as that space is well standardized and servers tend to be compartmented.
    Integration between server apps is much less common than integration
    between desktop apps, hence the harm that our modularization idiocies
    cause less harm.

 2) Linux's growth is still feeding on the remains of the destruction of Unix.

Linux is struggling on the desktop due to the desktop's inherent complexity, 
due to the lack of the Unix inertia and due to incompetence, insensitivity, 
intellectual arrogance and shortsightedness of server-centric thinking, like 

But the thing is, Apple doesnt really care about the server space, yet. It is 
lucrative but it is a side-show: it will fall automatically to the 'winner' of 
the desktop (or gadget) of tomorrow.

Has the quick fall of Banyan Vines or Netware (both excellent all-around 
server products) taught you nothing?

We need a lot more desktop focus in the kernel community. The best method to 
achieve this, that i know of currently, is to simply have kernel developers 
think outside the kernel box and to have them do bits of user-space coding as 
well - and in particular desktop coding. To eat our own dogfood in essence. 
Suffer through crap we cause to user-space. To face the _real_ difficulties of 


As a maintainer you certainly dont have to write a single line of code, if you 
dont want to. You 'just' need to care about the big picture and encourage/help 
the flow and balance of the whole project.

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:29 am

Agreed (minus the 'package boundaries' stuff).  Also, Linux is cheaper 

It's struggling because it isn't competitive technically with other 
desktops, because there is no application base, because of a 
chicken-and-egg problem with some drivers, because lack of a stable ABI 
means you can't get a driver CD with your device so you need a 
yet-unreleased kernel, because the zillion binary incompatible 
distributions mean that application developers don't know what to code 
and test for, because of lack of documentation, to name a few.  At least 
it's improving all the time.

The incompetence, insensitivity, intellectual arrogance and 
shortsightedness of server-centric thinking of my arguments/position are 

It won't automatically fall to Apple, there's tons of middleware and 
server apps that need porting (the "ecosystem"), plus they need to work 
hard on improving their kernel which is desktop oriented.  Looks like 

Not familiar with Banyan, but wasn't Netware a cooperative multitasking 
command line only thing?  It couldn't compete with preemptive modern 

Try it yourself and report the experience.  Note: perf is not desktop 

Not at all.  They have excellent development tools and lots of 
middleware and other third party products that make it easy to pick 
Windows.  For example, Exchange is more or less standard for groupware, 
and they made C# and the technology around it easy to develop for, 

I haven't written that line of code, and no one else has either.  Don't 
tell me they're all scared of me.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 1:31 pm

Only if you apply it as a totalitarian rule.

Furthermore, the logical conclusion of _your_ line of argument (applied in a 
totalitarian manner) is that 'nothing should be built into the kernel'.

I.e. you are arguing for microkernel Linux, while you see me as arguing for a 
monolithic kernel.

Reality is that we are somewhere inbetween, we are neither black nor white:
it's shades of grey.

If we want to do a good job with all this then we observe subsystems, we see 
how they relate to the physical world and decide about how to shape them. We 
identify long-term changes and re-design modularization boundaries in 
hindsight - when we got them wrong initially. We dont try to rationalize the 
status-quo.

Lets see one example of that thought process in action: Oprofile.

We saw that the modularization of oprofile was a total nightmare: a separate 
kernel-space and a separate user-space component, which was in constant 
version friction. The ABI between them was stiffling: it was hard to change it 
(you needed to trickle that through the tool as well which was on a different 
release schedule, etc.e tc.)

The result was sucky usability that never went beyond some basic 'you can do 
profiling' threshold. The subsystem worked well within that design box, and it 
was worked on by highly competent people - but it was still far, far away from 
the potential it could have achieved.

So we observed those problems and decided to do something about it:

 - We unified the two parts into a single maintenance domain. There's
   the kernel-side in kernel/perf_event.c and arch/*/*/perf_event.c,
   plus the user-side in tools/perf/. The two are connected by a very
   flexible, forwards and backwards compatible ABI.

 - We moved much more code into the kernel, realizing that transparent
   and robust instrumentation should be offered instead of punting
   abstractions into user-space (which is in a disadvantaged position
   to implement system-wide abstractions).

 - We created a ...
From: Avi Kivity
Date: Sunday, March 21, 2010 - 2:30 pm

I'm certainly a minimalist, but that doesn't follow.  Things that 
require privileged access, or access to the page cache, or that can't be 
made to perform otherwise should certainly be in the kernel.  That's why 
I submitted kvm for inclusion in the first place.

If it's something that can work just as well in userspace but we can't 
be bothered to fix any 'deployment nightmares', then they shouldn't be 
in the kernel.  Examples include lvm2 and mdadm (which truly are 
'deployment nightmares' - you need to start them before you have access 

No. I'm arguing for reducing bloat wherever possible.  Kernel code is 

I'm not for the status quo either - I'm for reducing the kernel code 

That's useful because perf is still small.  If it were a full fledged 
350KLOC GUI, then most of the development would concentrate on the GUI 
and very little (relatively) would have to do with the kernel.

Qemu is in that state today.  Please, please look at the recent commits 
and check how many have actually anything to do with kvm, and how many 

No argument.

I have a similar experience with kvm.  The user/kernel break is at the 
cpu virtualization level - that is kvm is solely responsible for 
emulating a cpu and userspace is responsible for emulating devices.  An 
exception was made for the PIC/IOAPIC/PIT due to performance 
considerations - they are emulated in the kernel as well.

A common FAQ is why do we not emulate real-mode instructions in qemu.  
The answer is that it the interface to kvm would be insane - it would 
emulate a partial cpu.  All other users of that interface would have to 
implement an emulator (there is also a practical argument - the qemu 

Excellent.  However qemu is written by developers for their users, and 
their users are not worried about an eject button in the qemu SDL 
interface, or about running the qemu command line by hand.  They have 
complicated management interfaces that do everything, so we concentrate, 
for example, on a robust RPC interface ...
From: Ingo Molnar
Date: Sunday, March 21, 2010 - 2:52 pm

1)

One of the primary design arguments of the micro-kernel design as well was to 
push as much into user-space as possible without impacting performance too 
much so you very much seem to be arguing for a micro-kernel design for the 
kernel.

I think history has given us the answer for that fight between microkernels 
and monolithic kernels.

Furthermore, to not engage in hypotheticals about microkernels: by your 
argument the Oprofile design was perfect (it was minimalistic kernel-space, 
with all the complexity in user-space), while perf was over-complex (which 
does many things in the kernel that could have been done in user-space).

Practical results suggest the exact opposite happened - Oprofile is being 
replaced by perf. How do you explain that?

2)

In your analysis you again ignore the package boundary costs and artifacts as 
if they didnt exist.

That was my main argument, and that is what we saw with oprofile and perf: 
while maintaining more kernel-code may be more expensive, it sure pays off for 
getting us a much better solution in the end.

And getting a 'much better solution' to users is the goal of all this, isnt 
it?

I dont mind what you call 'bloat' per se if it's for a purpose that users find 
like a good deal. I have quite a bit of RAM in most of my systems, having 50K 
more or less included in the kernel image is far less important than having a 
healthy and vibrant development model and having satisfied users ...

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 11:49 pm

I am not arguing for a microkernel.  Again: reduce bloat where possible, 

I did not say that the amount of kernel and userspace code is the only 
factor deciding the quality of software.  If that were so, microkernels 
would have won out long ago.

It may be that that perf has too much kernel code, and won against 
oprofile despite that because it was better in other areas.  Or it may 
be that perf has exactly the right user/kernel division.  Or maybe perf 
needs some of the code moved from userspace to the kernel.  I don't 
know, I haven't examined the code.

The user/kernel boundary is only one metric for code quality.  Nor is it 
always in favour of pushing things to userspace.  Narrowing or 
simplifying an interface is often an argument in favour of pushing 
things into the kernel.

IMO the reason perf is more usable than oprofile has less to do with the 
kernel/userspace boundary and more do to with effort and attention spent 

Package costs are real.  We need to bear them.  I don't think that 
because maintaining another package (and the interface between two 

I'm not worried about 50K or so, I'm worried about a bug in those 50K 
taking down the guest.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 4:23 am

If you are interested in the first-hand experience of the people who are doing 
the perf work then here it is: by far the biggest reason for perf success and 
perf usability is the integration of the user-space tooling with the 
kernel-space bits, into a single repository and project.

The very move you are opposing so vehemently for KVM.

Oprofile went the way you proposed, and it was a failure. It failed not 
because it was bad technology (it was pretty decent and people used it), it 
was not a failure because the wrong people worked on it (to the contrary, very 
capable people worked on it), it was a failure in hindsight because it simply 
incorrectly split into two projects which stiffled the progress of each other.

Obviously 3 years ago you'd have seen a similar, big "Oprofile is NOT broken!" 
flamewar, had i posted the same observations about Oprofile that i expressed 
about KVM here. (In fact there was a similar, big flamewar about all this when 
perf was posted a year ago.)

And yes, (as you are aware of) i see very similar patterns of inefficiency in 
the KVM/Qemu tooling relationship as well, hence did i express my views about 
it.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 5:49 am

Please take a look at the kvm integration code in qemu as a fraction of 


Every project that has some kernel footprint, except perf, is split like 
that.  Are they all failures?

Seems like perf is also split, with sysprof being developed outside the 
kernel.  Will you bring sysprof into the kernel?  Will every feature be 
duplicated in prof and sysprof?

-- 
error compiling committee.c: too many arguments to function

--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 6:01 am

Hi Avi,


I am glad you brought it up! Sysprof was historically outside of the
kernel (with it's own kernel module, actually). While the GUI was
nice, it was much harder to set up compared to oprofile so it wasn't
all that popular. Things improved slightly when Ingo merged the custom
kernel module but the _userspace_ part of sysprof was lagging behind a
bit. I don't know what's the situation now that they've switched over
to perf syscalls but you probably get my point.

It would be nice if the two projects merged but I honestly don't see
any fundamental problem with two (or more) co-existing projects.
Friendly competition will ultimately benefit the users (think KDE and
Gnome here).

                        Pekka
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:54 am

See my previous mail - what i see as the most healthy project model is to have 
a full solution reference implementation, connected to a flexible halo of 
plugins or sub-apps.

Firefox does that, KDE does that, and Gnome as well to a certain degree.

The 'halo' provides a constant feedback of new features, and it also provides 
competition and pressure on the 'main' code to be top-notch.

The problem i see with KVM is that there's no reference implementation! There 
is _only_ the KVM kernel part which is not functional in itself. Surrounded by 
a 'halo' - where none of the entities is really 'the' reference implementation 
we call 'KVM'.

This causes constant quality problems as the developers of the main project 
dont have constant pressure towards good quality (it is not their 
responsibility to care about user-space bits after all), plus it causes a lack 
of focus as well: integration between (friendly) competing user-space 
components is a lot harder than integration within a single framework such as 
Firefox.

I hope this explains my points about modularization a bit better! I suggested 
KVM to grow a user-space tool component in the kernel repo in tools/kvm/, 
which would become the reference implementation for tooling. User-space 
projects can still provide alternative tooling or can plug into this tooling, 
just like they are doing it now. So the main effect isnt even on those 
projects but on the kernel developers. The ABI remains and all the user-space 
packages and projects remain.

Yes, i thought Qemu would be a prime candidate to be the baseline for 
tools/kvm/, but i guess that has become socially impossible now after this 
flamewar. It's not a big problem in the big scheme of things: tools/kvm/ is 
best grown up from a small towards larger size anyway ...

Thanks,
 
	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:04 pm

The reference implementation is qemu-kvm.git, in the future qemu.git.  
Like the reference implementation of device-mapper is 

The developers of the main project are very much aware that users don't 


Seems like wanton duplication of effort.  Can we throw so many 
developer-years away on duplicate projects?  Assuming not all are true 

Qemu is open source, you can cp it into tools/kvm.  Rewriting it from 
scratch is a mammoth effort, there's a reason kvm, Xen, and virtualbox 
all use qemu.  Qemu itself copied code from bochs.  Writing this stuff 
is hard, especially if there is something already working.

You'll probably get much better threading (the qemu device model is 
still single threaded), but it will take years to reach where qemu is 
already at.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Olivier Galibert
Date: Tuesday, March 23, 2010 - 2:46 am

I'm curious, where would you put the limit?  Let's imagine a tools/kvm
appears, be it qemu or not, that's outside the scope of my question.
Would you put the legacy PC bios in there (seabios I guess)?  The EFI
bios? The windows-compiled paravirtual drivers? The Xorg paravirtual
DDX ?  Mesa (which includes the pv gallium drivers)? The
libvirt-equivalent? The GUI?

That's not a rhetorical question btw, I really wonder where the limit
should be.

  OG.
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:47 am

You have to admit that much of Qemu's past 2-3 years of development was 
motivated by Linux/KVM (i'd say more than 50% of the code). As such it's one 
and the same code base - you just continue to define Qemu to be different from 
KVM.

I very much remember how Qemu looked like _before_ KVM: it was a struggling, 

Would you accept (or at least not NAK) a new tools/kvm/ tool that builds 
tooling from grounds up, while leaving Qemu untouched? [assuming it's all 
clean code, etc.]

Although i have doubts about how well that would work 'against' your opinion: 
such a tool would need lots of KVM-side features and a positive attitude from 

No. Did i ever claim KVM was a failure? I said it's hindered by this design 
aspect.


I'd prefer if sysprof merged into perf as 'perf view' - but its maintainer 
does not want that - which is perfectly OK. So we are building equivalent 
functionality into perf instead.

Think about it like Firefox plugins: the main Firefox project picks up the 
functionality of the most popular Firefox plugins all the time. Session Saver, 
Tab Mix Plus, etc. were all in essence 'merged' (in functionality, not in 
code) into the 'reference' Firefox project.

I think that's a fundamentally healthy model: it allows extensions and thus 
give others an honest chance to show that you are potentially coding an 
inferior piece of code - but also express a clear opinion about what you 
consider a full, usable, high-quality reference implementation and constantly 
improve this reference implementation.

I dont think that can be argued to be a bad model. Yes, it takes a bit of 
thinking outside the box to do tools/kvm/ but of all people i'd expect some of 
that from you.

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 11:15 am

It's not the same code base.  kvm provides a cpu virtualization service, 
qemu uses it.  There could be other users.  qemu could go away one day 


I couldn't NAK tools/kvm any more than I could NAK a new project outside 
the kernel repository.  IMO it would be duplicated effort, but like I 
mentioned before, I can't tell volunteers what to do, only recommend 

Functionality that can be implemented in userspace will not be accepted 
into kvm unless there are very good reasons why it should be.  Things 



There's a difference between absorbing a small plugin and duplicating a 
project.

-- 
error compiling committee.c: too many arguments to function

--

From: oerg Roedel
Date: Monday, March 22, 2010 - 4:10 am

Since you are talking so much about oProfile in this thread I think it
is important to mention that the problem with oProfile was not the
repository separation.

The problem was (and is) that the kernel and the user-space parts are
maintained by different people who dont talk to each other or have a
direction where they want to go with the project. Basically the reason
of the oProfile failure is a disfunctional community. I told the
kernel-maintainer several times to also maintain user-space but he
didn't want that.

The situation with KVM is entirely different. Avi commits to kvm.git and
qemu-kvm.git so he maintains both. Anthony is working to integrate the
qemu-kvm changes into upstream qemu. Further these people work very
closely together and the community around KVM works well too. The
problems that oProfile has are not even in sight for KVM.

	Joerg

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 5:22 am

Caused by: repository separation and the inevitable code and social fork a 

Caused by: repository separation and the inevitable code and social fork a 

Caused by: repository separation and the inevitable code and social fork a 

What you fail to realise (or what you fail to know, you werent around when 
Oprofile was written, i was) is that Oprofile _did_ have a functional single 
community when it was written. The tooling and the kernel bits was written by 
the same people.

But a decade is a long time and the drift happened due to the inevitability of 
the repository separation, and due to the _inability_ to reach a sane, usable 
solution within that framework of separation.

So i dont see much of a difference to the Oprofile situation really and i see 
many parallels. I also see similar kinds of desktop usability problems.

The difference is that we dont have KVM with a decade of history and we dont 
have a 'told you so' KVM reimplementation to show that proves the point. I 
guess it's a matter of time before that happens, because Qemu usability is so 
absymal today - so i guess we should suspend any discussions until that 
happens, no need to waste time on arguing hypoteticals.

I think you are rationalizing the status quo.

It's as if you argued in 1990 that the unification of East and West Germany 
wouldnt make much sense because despite clear problems and incompatibilites 
and different styles westerners were still allowed to visit eastern relatives 
and they both spoke the same language after all ;-)

Thanks,

	Ingo
--

From: Joerg Roedel
Date: Monday, March 22, 2010 - 6:46 am

No, the split-repository situation was the smallest problem after all.
Its was a community thing. If the community doesn't work a single-repo
project will also fail. Look at the state of the alpha arch in Linux
today, it is maintained in one repository but nobody really cares about
it. Thus it is miles behine most other archs Linux supports today in

Yes, this was probably the time when everybody was enthusiastic about
the feature and they could attract lots of developers. But situation

The difference is that KVM has a working community with good developers

We actually have lguest which is small. But it lacks functionality and

I see that there are issues with KVM today in some areas. You pointed
out the desktop usability already. I personally have trouble with the
qem-kvm.git because it is unbisectable. But repository unification
doesn't solve the problem here.
The point for a single repository is that it simplifies the development
process. I agree with you here. But the current process of KVM is not
too difficult after all. I don't have to touch qemu sources for most of

Um, hmm. I don't think these situations have enough in common to compare
them ;-)

	Joerg



--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 9:32 am

I dont know how you can find the situation of Alpha comparable, which is a 
legacy architecture for which no new CPU was manufactored in the past ~10 
years.

The negative effects of physical obscolescence cannot be overcome even by the 
very best of development models ...


So, what do you think creates code communities and keeps them alive? 
Developers and code. And the wellbeing of developers are primarily influenced 
by the repository structure and by the development/maintenance process - i.e. 
by the 'fun' aspect. (i'm simplifying things there but that's the crux of it.)

So yes, i do claim that what stiffled and eventually killed off the Oprofile 
community was the split repository. None of the other Oprofile shortcomings 
were really unfixable, but this one was. It gave no way for the community to 
grow in a healthy way, after the initial phase. Features were more difficult 
and less fun to develop.

And yes, there were times when there was still active Oprofile development but 
the development process warning signs should have been noticed, and the 
community could have been kept alive by unification and similar measures. 
Instead what happened was a complete rewrite and a competitive replacement by 
perf. (Which isnt particularly nice to users btw. - they prefer more gradual 
transitions - but there was no other option, so many problems accumulated in 
Oprofile.)

I simply do not want to see KVM face the same fate, and yes i do see similar 


Oprofile certainly had good developers and maintainers as well. In the end it 
wasnt enough ...

Also, a project can easily still be 'alive' but not reach its full potential. 

Why do you assume that my argument means that KVM isnt viable today? It can 
very well still be viable and even healthy - just not _as healthy_ as it could 

I suggested long ago to merge lguest into KVM to cover non-VMX/non-SVM 

Why doesnt it solve the bisectability problem? The kernel repo is supposed to 

In my judgement you'd have to do ...
From: Frank Ch. Eigler
Date: Monday, March 22, 2010 - 10:17 am

In your very previous paragraphs, you enumerate two separate causes:
"repository structure" and "development/maintenance process" as being
sources of "fun".  Please simply accept that the former is considered
by many as absolutely trivial compared to the latter, and additional
verbose repetition of your thesis will not change this.

- FChE
--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 10:27 am

Hi Frank,


I can accept that many people consider it trivial but the problem is
that we have _real data_ on kmemtrace and now perf that the amount of
contributors is significantly smaller when your code is outside the
kernel repository. Now admittedly both of them are pretty intimate
with the kernel but Ingo's suggestion of putting kvm-qemu in tools/ is
an interesting idea nevertheless.

It's kinda funny to see people argue that having an external
repository is not a problem and that it's not a big deal if building
something from the repository is slightly painful as long as it
doesn't require a PhD when we have _real world_ experience that it
_does_ limit developer base in some cases. Whether or not that applies
to kvm remains to be seen but I've yet to see a convincing argument
why it doesn't.

                        Pekka
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:32 am

qemu has non-Linux developers.  Not all of their contributions are 
relevant to kvm but some are.  If we pull qemu into tools/kvm, we lose them.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 10:39 am

Qemu had very few developers before KVM made use of it - i know it because i 
followed the project prior KVM.

So whatever development activitity Qemu has today, it's 99% [WAG] attributable 
to KVM. It might have non-Linux contributors, but they wouldnt be there if it 
wasnt for all the Linux contributors ...

Furthermore, those contributors wouldnt have to leave - they could simply use 
a different Git URI ...

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:58 am

tools/kvm would drop support for non-Linux hosts, for tcg, and for 
architectures which kvm doesn't support ("clean and minimal").  That 
would be the real win, not sharing the repository.  But those other 
contributors would just stay with the original qemu.

-- 
error compiling committee.c: too many arguments to function

--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 10:52 am

Hi Avi,


Yeah, you probably would but the hypothesis is that you'd end up with
a bigger net developer base for the _Linux_ version. Now you might not
think that's important but I certainly do and I think Ingo does as
well. ;-)

That said, pulling 400 KLOC of code into the kernel sounds really
excessive. Would we need all that if we just do native virtualization
and no actual emulation?

                       Pekka
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 11:04 am

You're probably correct, but the point is that non-Linux developers also 

What is native virtualization and no actual emulation?

-- 
error compiling committee.c: too many arguments to function

--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 11:10 am

What I meant with "actual emulation" was running architecture A code
on architecture B what was qemu's traditional use case. So the
question was how much of the 400 KLOC do we need for just KVM on all
the architectures that it supports?
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 11:55 am

qemu is 620 KLOC.  Without cpu emulation that drops to ~480 KLOC.  Much 
of that is device emulation that is not supported by kvm now (like ARM) 
but some might be needed again in the future (like ARM).

x86-only is perhaps 300 KLOC, but kvm is not x86 only.

And that is with a rudimentary GUI.  GUIs are heavy.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 10:43 am

Yeah.

Also, if in fact the claim that the 'repository does not matter' is true then 
it doesnt matter that it's hosted in tools/kvm/ either, right?

I.e. it's a win-win situation. Worst-case nothing happens beyond a Git URI 
change. Best-case the project is propelled to never seen heights due to 
contribution advantages not contemplated and not experienced by the KVM guys 
before ...

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 11:02 am

Again, the second it's moved to tools/kvm/ we strip it off anything that 

You're exaggerating.  There were 773 commits into qemu.git (excluding 
qemu-kvm.git) in the past three months.  162 for the same period for 
tools/perf.  The pool is not that deep.

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:44 am

There is nothing fun about having one repository or two.  Who cares 
about this anyway?

tools/kvm/ probably will draw developers, simply because of the glory 
associated with kernel work.  That's a bug, not a feature.  It means 
that effort is not distributed according to how it's needed, but because 

The number of kvm and qemu developers keeps increasing.

We're having a kvm forum in August where we all meet.  Come and see for 

Rusty posted some initial patches for pv-only kvm but he lost interest 
before they were completed.  No one followed up.

btw, lguest has a single repository, userspace and kernel in the same 

These days qemu-kvm.git is bisectable (though not always trivially).  

Something I've wanted for a long time is to port kvm_stat to use 
tracepoints instead of the home-grown instrumentation.  But that is 
unrelated to this new tracepoint.  Other than that we're satisfied with 

There are plenty of un-fun tasks (like fixing bugs and providing RAS 
features) that we're doing.  We don't do this for fun but to satisfy our 
users.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:10 pm

And yet your solution to that is to ... do all your work in the kernel space 

Despite it being another in-kernel subsystem that by your earlier arguments 

So which one is it, KVM developers are volunteers that do fun stuff and cannot 
be told about project priorities, or KVM developers are pros who do unfun 
stuff because they can be told about priorities?

I posit that it's both: and that priorities can be communicated - if only you 
try as a maintainer. All i'm suggesting is to add 'usable, unified user-space' 
to the list of unfun priorities, because it's possible and because it matters.

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 12:18 pm

I've spent the past few months dealing with customers using the 
libvirt/qemu/kvm stack.  Usability is a major problem and is a top 
priority for me.  That is definitely a shift but that occurred before 
you started your thread.

But I disagree with your analysis of what the root of the problem is.  
It's a very kernel centric view and doesn't consider the interactions 
between userspace.

Regards,


--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:23 pm

I have done plenty of userspace work in qemu.  I don't have a lack of 
interest in qemu, just in a desktop GUI.  I'm not a GUI person and my 
employer doesn't have a desktop-on-desktop virtualization product that I 

I'm satisfied with it as a user.  Architecturally, I'd have preferred it 
to be a userspace tool.  It might have improved usability as well to 
have something with --help instead of a set of debugfs files.  But I'm a 

 From my point of view as maintainer, all contributors are volunteers, I 
can't tell any of them what to do.  From the point of view of many of 
these volunteer's employers, they are wage slaves who do as they're told 
or else.

So: when someone sends me a patch I gratefully accept if it is good or 
point out the issues if not.  At the secret Red Hat headquarters and the 
kvm weekly conference call I participate in deciding priorities and task 

So: I require a volunteer to write some GUI code before I accept a 
patch.  Back at the Red Hat lair, we think of what features we drop from 
the product because the kvm maintainer has gone nuts.

The 'unified' part of your suggestion is not a requirement, but an 
implementation detail.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Andrea Arcangeli
Date: Monday, March 22, 2010 - 12:28 pm

IMHO blaming anybody for it but qemu maintainership is very
unfair. They intentionally reinveinted a less self contained,
inferior, underperforming, underfeatured wheel instead of doing the
right thing and just making sure that it as self contained enough as
possible to avoid risking destabilizing their existing codebase. What
can anybody (without qemu git commit access) do about it unless qemu
git maintainer change attitude, dumps its qemu/kvm-all.c nosense for
good, and do the right thing so we can unify for real?

We need to move forward, including multithread the qemu core and be
ready to include desktop virtualization protocol when they're ready
for submission without being suggested to extend vnc instead to gain a
similar speedup (i.e. yet another inferior wheel).

Unification means that _all_ qemu users, pure research, theoretical
interest, Xen, virtualbox, weird pure software architecture, will be
able to push their stuff in for the common good, but that also shall
apply to KVM! It has to become clear that reinveinting inferior wheels
instead of merging the real thing, is absolutely time wasteful,
unnecessary, and it won't make any difference as far as KVM is
concerned, proof is that 0% of userbase runs qemu git to run KVM
(except the kvm-all.c developers to test it perhaps or somebody by
mistake not adding -kvm prefix to command line maybe). I don't pretend
to rate KVM as more important than all the rest of niche usages for
qemu but it shall be _as_ important as the rest and it'd be nice one
day to be able to install only qemu on a system and get something
actually usable in production.

I very much like that qemu gets contributions from everywhere, it's
also nice it can run without KVM (that is purely useful as a
debugging tool to me but still...). I think it can all happen and
unification should be the object for the gain of everyone in both
qemu/kvm and even xen and all the rest.
--

From: Joerg Roedel
Date: Monday, March 22, 2010 - 12:20 pm

The maintainers of that architecture could at least continue to maintain
it. But that is not the case. Most newer syscalls are not available and
overall stability on alpha sucks (kernel crashed when I tried to start
Xorg for example) but nobody cares about it. Hardware is still around

Right. A living community needs developers that write new code. And the
repository structure is one important thing. But in my opinion it is not
the most important one. With my 3-4 years experience in the kernel
community I made the experience that the maintainers are the most
important factor. I find a maintainer not commiting or caring about
patches or not releasing new versions much worse than the wrong
repository structure.
oProfile has this problem with its userspace part. I partly made this
bad experience with x86-64 before the architecture merge. KVM does not

The biggest problem oProfile has is that it does not support per-process
measuring. This is indeed not unfixable but it also doesn't fit well in

In fact, the development process in KVM has improved over time. In the
early beginnings everything was kept in svn. Avi switched to git some
day but at the time when we had these kvm-XX releases both kernel- and
user-space together were unbisectable. This has improved to a point
where the kernel-part could be bisected. The KVM maintainers and
community have shown in the past that they can address problems with the


That would have been the best. Rusty already started this work and

Because Marcelo and Avi try to keep as close to upstream qemu as
possible. So the qemu repo is regularly merged in qemu-kvm and if you
want to bisect you may end up somewhere in the middle of the qemu
repository which has only very minimal kvm-support.
The problem here is that two qemu repositorys exist. But the current
effort of Anthony is directed to create a single qemu repository. But
thats not done overnight.
Merging qemu into the kernel would make Linus in fact a qemu maintainer.

True. Tools for ...
From: Avi Kivity
Date: Monday, March 22, 2010 - 12:28 pm

It's in fact possible to bisect qemu-kvm.git.  If you end up in 
qemu.git, do a 'git bisect skip'.  If you end up in a merge, call the 
merge point A, bisect A^1..A^2, each time merging A^1 before compiling 
(the merge is always trivial due to the way we do it).

Not fun, but it works.  When we complete merging kvm integration into 
qemu.git, this problem will disappear.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:49 pm

You are arguing why maintainers do not act as you suggest, against the huge 
negative effects of physical obscolescence?

Please use common sense: they dont act because ... there are huge negative 
effects due to physical obscolescence?

No amount of development model engineering can offset that negative.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Sunday, March 21, 2010 - 4:35 pm

The solution should be a long lived piece of code that runs without 
kernel privileges.  How the code is delivered to the user is a separate 
problem.

If you want to argue that the kernel should build an initramfs that 
contains some things that always should be shipped with the kernel but 
don't need to be within the kernel, I think that's something that's long 
over due.

We could make it a kernel thread, but what's the point?  It's much safer 
for it to be a userspace thread and it doesn't need to interact with the 
kernel in an intimate way.

Regards,

Anthony Liguori

--

From: Avi Kivity
Date: Saturday, March 20, 2010 - 12:35 am

I did suggest a symbol server, and using a well-known location, though 
I'm unhappy with it.  Multiple guest management should be done by the 


I am comfortable with having someone I trust maintain qemu.  While 
sometimes Anthony overrides me on issues where I know I'm right and he's 
wrong, still I prefer that to having to do everything myself, I would 
surely do a worse job due to overload.

I you actually look at qemu patches, the vast majority have little to do 
directly with kvm; and I (along with Marcelo) maintain the kvm 

That wouldn't change at all if I were to maintain it, since I wouldn't 
start writing a GUI for it and wouldn't force other contributors to do 

So, do you think a reply to a patch along the lines of

   NAK.  Improving scalability is pointless while we don't have a decent 
GUI.  I'll review you RCU patches
   _after_ you've contributed a usable GUI.


For a given area, yes.  It makes sense to clean up code before changing 
it, otherwise cruft accumulates rapidly.  What you're describing is 
completely different and amounts to total disregard of contributors' 

The general health of qemu in terms of code quality was indeed pretty 
bad and there was (and is) a massive effort to modernise it.  If you're 
interested look at qdev and qmp.  Both are efforts to improve the 
infrastructure rather than add features on rotten code, and very 
successful IMO.  There was no effort to write a GUI since no one appears 


If there were no capable maintainer I would reluctantly step in.  That 
is not the case.  If I were to displace Anthony then qemu quality would 
suffer, or I would have to drop kvm maintainership, or, if some false 

Neither do you.  At least I have spent enough time among real usability 
people to know this.  I don't have any pretences in this area and am 
happy to leave it to the experts.  As infrastructure projects kvm and 
qemu should concentrate on providing flexible capabilities to consumers, 
which then expose it to users.  ...
From: Ingo Molnar
Date: Sunday, March 21, 2010 - 12:06 pm

What does this have to do with RCU?

I'm talking about KVM, which is a Linux kernel feature that is useless without 
a proper, KVM-specific app making use of it.

RCU is a general kernel performance feature that works across the board. It 
helps KVM indirectly, and it helps many other kernel subsystems as well. It 
needs no user-space tool to be useful.

KVM on the other hand is useless without a user-space tool.

[ Theoretically you might have a fair point if it were a critical feature of 
  RCU for it to have a GUI, and if the main tool that made use of it sucked. 
  But it isnt and you should know that. ]

Had you suggested the following 'NAK', applied to a different, relevant 
subsystem:

  |   NAK.  Improving scalability is pointless while we don't have a usable 
  | tool.  I'll review you perf patches _after_ you've contributed a usable 
  | tool.

you would have a fair point. In fact, we are doing that we are living by that. 
It makes absolutely zero sense to improve the scalability of perf if its 
usability sucks.

So where you are trying to point out an inconsistency in my argument there is 

That is my precise point.

KVM is a specific subsystem or "area" that makes no sense without the 
user-space tooling it relates to. You seem to argue that you have no 'right' 
to insist on good quality of that tooling - and IMO you are fundamentally 
wrong with that.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 1:22 pm

The example was rcuifying kvm which took place a bit ago.  Sorry, it 

Correct.  So should I tell someone that has sent a patch that rcu-ified 
kvm in order to scale it, that I won't accept the patch unless they do 
some usability userspace work?  say, implementing an eject button. 

That might hold, but the tool is usable at least for some people - it 
runs in production.  The people running it won't benefit from an eject 
button or any usability improvement since they run it through a 
centralized management tool that hides everything.  They will benefit 
from the scalability patches.  Should I still make those patches 

kvm contains many sub-areas.  I'm not going to tie unrelated things 
together like the GUI and sclability, configuration file format and 
emulator correctness, nested virtualization and qcow2 asynchronity, or 
other crazy combinations.  People either leave en mass or become 
frustrated if they can't.  I do reject patches touching a sub-area that 
I think need to be done in userspace, for example.

That's not to say kvm development is random.  We have a weekly 
conference call where regular contributors and maintainers of both qemu 
and kvm participate and where we decide where to focus.  Sadly the issue 
of a qemu GUI is not raised often.  Perhaps you can participate and 
voice your concerns.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 1:55 pm

Of course you could say the following:

  ' Thanks, I'll mark this for v2.6.36 integration. Note that we are not
    able to add this to the v2.6.35 kernel queue anymore as the ongoing 
    usability work already takes up all of the project's maintainer and 
    testing bandwidth. If you want the feature to be merged sooner than that 
    then please help us cut down on the TODO and BUGS list that can be found 
    at XYZ. There's quite a few low hanging fruits there. '

Although this RCU example is 'worst' possible example, as it's a pure speedup 
change with no functionality effect.

Consider the _other_ examples that are a lot more clear:

   ' If you expose paravirt spilocks via KVM please also make sure the KVM
     tooling can make use of it, has an option for it to configure it, and 
     that it has sufficient efficiency statistics displayed in the tool for 
     admins to monitor.'

   ' If you create this new paravirt driver then please also make sure it can
     be configured in the tooling. '

   ' Please also add a testcase for this bug to tools/kvm/testcases/ so we dont
     repeat this same mistake in the future. '

I'd say most of the high-level feature work in KVM has tooling impact.

And note the important arguement that the 'eject button' thing would not occur 
naturally in a project that is well designed and has a good quality balance. 
It would only occur in the transitionary period if a big lump of lower-quality 
code is unified with higher-quality code. Then indeed a lot of pressure gets 
created on the people working on the high-quality portion to go over and fix 
the low-quality portion.

Which, btw., is an unconditonally good thing ...

But even an RCU speedup can be fairly linked/ordered to more pressing needs of 
a project.

Really, the unification of two tightly related pieces of code has numerous 
clear advantages. Please give it some thought before rejecting it.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Sunday, March 21, 2010 - 2:42 pm

That would be shooting at my own foot as well as the contributor's since 
I badly want that RCU stuff, and while a GUI would be nice, that itch 
isn't on my back.

You're asking a developer and a maintainer to put off the work they're 
interested in, in order to work on something someone else is interested 

All three happen quite commonly in qemu/kvm development.  Of course 
someone who develops a feature also develops a patch that exposes it in 

Usually, pretty low.  Plumbing down a feature is usually trivial.  There 
are exceptions, of course - smp is only supported in qemu-kvm.git, not 
in upstream qemu.git, for example.  In any case of course the work is 
done in both qemu and kvm - do you think people develop features to see 



I'm not blind to the advantages.  Dropping tcg would be the biggest of 
them by far (much more than moving the repository, IMO).  But there are 
disadvantages as well.

Around two years ago I seriously considered forking qemu, at this time I 
do not think it is a good idea.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 2:54 pm

I think this sums up the root cause of all the problems i see with KVM pretty 
well.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Sunday, March 21, 2010 - 5:16 pm

A good maintainer has to strike a balance between asking more of people 
than what they initially volunteer and getting people to implement the 
less fun things that are nonetheless required.  The kernel can take this 
to an extreme because at the end of the day, it's the only game in town 
and there is an unending number of potential volunteers.  Most other 
projects are not quite as fortunate.

When someone submits a patch set to QEMU implementing a new network 
backend for raw sockets, we can push back about how it fits into the 
entire stack wrt security, usability, etc.  Ultimately, we can arrive at 
a different, more user friendly solution (networking helpers) and along 
with some time investment on my part, we can create a much nicer, more 
user friendly solution.  Still command line based though.

Responding to such a patch set with, replace the SDL front end with a 
GTK one that lets you graphically configure networking, is not 
reasonable and the result would be one less QEMU contributor in the long 
run.

Overtime, we can, and are, pushing people to focus more on usability.  
But that doesn't get you a first class GTK GUI overnight.  The only way 
you're going to get that is by having a contributor be specifically 
interesting in building such a thing.

We simply haven't had that in the past 5 years that I've been involved 
in the project.  If someone stepped up to build this, I'd certainly 
support it in every way possible and there are probably some steps we 
could take to even further encourage this.

Regards,

Anthony Liguori

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 4:59 am

Sorry to be blunt, but i dont think there's a different way to say it: i am a 
user of the software you are maintaining (Qemu) and i dont think you have the 
basis to educate people about what a good maintainer should do to achieve a 
quality end result.

I think you could/should learn much from Linus and others who very much 
require quality to permeate the full dimension of a contribution (including 
usability), beyond the narrow, local scope of the contribution.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:13 am

I think we agree at last.  Neither I nor my employer are interested in 
running qemu as a desktop-on-desktop tool, therefore I don't invest any 
effort in that direction, or require it from volunteers.

If you think a good GUI is so badly needed, either write one yourself, 
or convince someone else to do it.

(btw, why are you interested in desktop-on-desktop?  one use case is 
developers, which don't really need fancy GUIs; a second is people who 
test out distributions, but that doesn't seem to be a huge population; 
and a third is people running Windows for some application that doesn't 
run on Linux - hopefully a small catergory as well.  Seems to be quite a 
small target audience, compared to, say, video editing)

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 4:14 am

Obviously your employer at least in part defers to you when it comes to KVM 
priorities.

So, just to make this really clear, _you_ are not interested in running qemu 
as a desktop-on-desktop tool, subsequently this kind of 
disinterest-for-desktop-usability trickled through the whole KVM stack and 
poisoned your attitude and your contributor's attitude.

Too sad really and it's doubly sad that you dont feel anything wrong about 

To a certain degree we are trying to do a small bit of that (see this very 
thread) - and you are NAK-ing and objecting the heck out of it via your 
unreasonable microkernelish and server-centric views.

With constant maintainer disinterest there's no wonder a non-desktop-oriented 
KVM becomes a self-fulfilling prophecy: you think the desktop does not matter, 
hence it becomes a reality in KVM space which you can constantly refer back to 
as a 'fact'.


I'm interested in desktop-on-desktop because i walk this world with open eyes 
and i care about Linux, and these days qemu-kvm is the first thing a new Linux 
user sees about Linux virtualization. I've observed several people i know in 
person to turn away from Linux and go back to Windows or go over to Apple 
because they had a much more mature solution.

I'd probably turn away from Linux myself if i were a newbie and if i were 
forced to use KVM on the desktop today.

Again, you dont seem to realize that you as a maintainer are at a central 
point where you have the ability to turn the self-fulfilling prophecy that 
'nobody cares about the Linux desktop' into a reality - or where you have the 
ability to prevent it from happening. It is hugely harmful process, especially 
as you seem to delude yourself that you have nothing to do with it.

Anyway, it's good you expressed your views about this as this will help the 
chances of a fresh restart. (which chances are still not too good though)

Thanks,
	
	Ingo
--

From: Alexander Graf
Date: Monday, March 22, 2010 - 4:23 am

Please, don't jump to unjust conclusions.

The whole point is that there's no money behind desktop-on-desktop virtualization. Thus nobody pays people to work on it. Thus nothing significant happens in that space.

If there was someone standing up to create a really decent desktop qemu front-end I'm confident we'd even officially suggest using that. In fact, that whole discussion did come up in the weekly Qemu/KVM community call and everybody agreed heavily that we do need a desktop client.

The problem is just that there is nobody standing up. And I hope you don't expect Avi to be the one creating a GUI.


Alex

--

From: Lukas Kolbe
Date: Monday, March 22, 2010 - 5:33 am

Besides, Ingo could just go ahead and use libvirt together with
virt-manager. It solves a few of the usability issues he came up with
somewhere in this thread, is available even in every current
distribution, and *actually* works quite well for the desktop usecase.
It just desparatly needs more brainpower and manpower to make it a
competitor to VirtualBox & Co, because its not as polished and
featurecomplete yet. But I bet virt-managers maintainers welcome patches
to fix and enhance usability. Most of the needed fixes probably wouldn't
touch qemu at all, let alone kvm.

Sorry to chime in with my opinion, but this whole thread is incredibly
boring and full of non-arguments yet really highly amusing.

-- 
Lukas


--

From: Avi Kivity
Date: Monday, March 22, 2010 - 5:29 am

I am also disinterested in ppc virtualization, yet it happened.  I am 
disinterested in ia64 virtualization, yet it happened.  I am 
disinterested in s390 virtualization, yet it happened.

Linus doesn't care about virtualization, yet it happened.

I don't tell my contributor what to be interested in, only whether their 
patches are good or not.  I can tell you that Red Hat contributors don't 
work on a desktop kvm GUI not because I discourage them, but because the 
product we are working on does not contain a desktop kvm GUI.  Jan 
Kiszka contributed a lot of debugger features, fixes, and improvement, 
presumably he and/or his employer need that more than a kvm desktop GUI.

I can't see why you see anything wrong with this.  People write patches 

It would be lovely to have a desktop kvm GUI.  I don't feel I have to 

The perf bits have nothing to do with a GUI or usability for general 
users.  Calling them "unreasonable microkernelish sever-centric views" 

It's a fact that virtualization is happening in the data center, not on 
the desktop.  You think a kvm GUI can become a killer application? fine, 
write one.  You don't need any consent from me as kvm maintainer (if 
patches are needed to kvm that improve the desktop experience, I'll 
accept them, though they'll have to pass my unreasonable microkernelish 
filters).  If you're right then the desktop kvm GUI will be a huge hit 
with zillions of developers and people will drop Windows and switch to 
Linux just to use it.

But my opinion is that it will end up like virtualbox, a nice app that 

If you're going to use words like 'dishonest' then please don't send me 

Which distribution are they using?  Most people would see virt-manager 
as the first thing, not open gnome-terminal and start typing in the qemu 
command line.  While it's not perfect, it does have a shiny GUI with 

It doesn't have to be me.  Better to pick someone who has a clue about 
usability to design and guide this effort.  That someone can work ...
From: Ingo Molnar
Date: Monday, March 22, 2010 - 5:44 am

You should know the answer yourself: the difference is that usability is a 
core quality of any project.

I as a maintainer can be neutral towards a number of features and patch 
attributes that i dont consider key aspects. (although they can grow out to 
become key features in the future. SMP was a fringe thing 15 years ago.)

Usability is not an attribute you can ignore and i for sure am never neutral 
towards usability deficiencies in patches - i consider usability a key 

Whether a feature is usable or not is sure a metric of 'goodness'.

You have restricted your metric of goodness artificially to not include 
usability. You do that by claiming that the user-space tooling of KVM, while 
being functionally absolutely essential for any user to even try out KVM, is 
'separate' and has no quality connection with the kernel bits of KVM.

It is a convenient argument that allows you to do the kernel bits only. It is 
absolutely catastrophic to the user who'd like to see a usable solution and a 
single project who stands behind their tech.

Thus, _today_, after years of neglect, you can claim that none of the dozens 
of usability problems of KVM has anything to do with the features you are 
working on today. It's in a separate project (the so-called 'Qemu' package) 
after all - none of KVM's business.

In reality if you consider it a single project then those bugs were all 
usability problems introduced earlier on, years ago, when a piece of 
functionality was exposed via KVM. It adds up and now you claim they have 
nothing to do with current work.

This is why i consider that line of argument rather dishonest ...

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 5:52 am

I am not going to reply to any more email from you on this thread.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 7:32 am

Because i pointed out that i consider a line of argument intellectually 
dishonest?

I did not say _you_ as a person are dishonest - doing that would be an ad 
honimen attack against your person. (In fact i dont think you are, to the 
contrary)

An argument can certainly be labeled dishonest in a fair discussion and it is 
not a personal attack against you to express my opinion about that.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 7:43 am

You're being excessively rude in this thread.  That might be acceptable 
on LKML but it's not how the QEMU and KVM communities behave.  This 
thread is a good example of why LKML has the reputation it has.  Avi and 
I argue all of the time on qemu-devel and kvm-devel and it's never 
degraded into a series of personal attacks like this.

I've been trying very hard to turn this into a productive thread 
attempting to capture your feedback and give clear suggestions about how 
you can solve achieve your desired functionality.

What are you looking to achieve?  To you just want to piss and moan 
about how terrible you think Avi and I are?  Or do you want to try to 
actually help make things better?

If you want to help make things better, please focus on making 
constructive suggestions and clarifying what you see as requirements.

Regards,

Anthony Liguori

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 8:55 am

I'm glad that we are at this more productive stage. I'm still trying to 
achieve the very same technological capabilities that i expressed in the first 
few mails when i reviewed the 'perf kvm' patch that was submitted by Yanmin.

The crux of the problem is very simple. To quote my earlier mail:

 |
 | - The inconvenience of having to type:
 |      perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms \
 |               --guestmodules=/home/ymzhang/guest/modules top
 |
 |
 |   is very obvious even with a single guest. Now multiply that by more guests ...
 |

For example we want 'perf kvm top' to do something useful by default: it 
should find the first guest running and it should report its profile.

The tool shouldnt have to guess about where the guests are, what their 
namespaces is and how to talk to them. We also want easy symbolic access to 
guest, for example:

  perf kvm -g OpenSuse-2 record sleep 1

I.e.:

 - Easy default reference to guest instances, and a way for tools to
   reference them symbolically as well in the multi-guest case. Preferably
   something trustable and kernel-provided - not some indirect information 
   like a PID file created by libvirt-manager or so.

 - Guest-transparent VFS integration into the host, to recover symbols and 
   debug info in binaries, etc.

There were a few responses to that but none really addressed those problems - 
they mostly tried to re-define the problem and suggested that i was wrong to 
want such capabilities and suggested various inferior approaches instead. See 
the thread for the details - i think i covered every technical suggestion that 
was made.

So we are still at an impasse as far as i can see. If i overlooked some 
suggestion that addresses these problems then please let me know ...

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 9:08 am

Two things are needed.  The first thing needed is to be able to 
enumerate running guests and identify a symbolic name.  I have a patch 
for this and it'll be posted this week or so.  perf will need to have a 
QMP client and it will need to look in ${HOME}/.qemu/qmp/ to sockets to 
connect to.

This is too much to expect from a client and we've got a GSoC idea 
posted to make a nice library for tools to use to simplify this.

The sockets are named based on UUID and you'll have to connect to a 
guest and ask it for it's name.  Some guests don't have names so we'll 

A guest is not a KVM concept.  It's a qemu concept so it needs to be 
something provided by qemu.  The other caveat is that you won't see 
guests created by libvirt because we're implementing this in terms of a 
default QMP device and libvirt will disable defaults.  This is desired 
behaviour.  libvirt wants to be in complete control and doesn't want a 

The way I'd like to see this implemented is a guest userspace daemon.  I 
think having the guest userspace daemon be something that can be updated 
by the host is reasonable.

In terms of exposing that on the host, my preferred approach is QMP.  
I'd be happy with a QMP command that is essentially, 
guest_fs_read(filename) and guest_fd_readdir(path).

If desired, one could implement a fuse filesystem that interacted with 
all local qemu instances to expose this on the host.  There's a lot of 
ugly things about fuse though so I think sticking to QMP is best 
(particularly with respect to root access of a fuse filesystem).

With just those couple things in place, perf should be able to do 
exactly what you want it to do.

Regards,


--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 9:59 am

Ok, that sounds interesting! I'd rather see some raw mechanism that 'perf kvm' 
could use instead of having to require yet another library (which generally 
dampens adoption of a tool). So i think we can work from there.

Btw., have you considered using Qemu's command name (task->comm[]) as the 
symbolic name? That way we could see the guest name in 'top' on the host - a 

I think just exposing the UUID in that lazy case would be adequate? It creates 

Hm, this sucks for multiple reasons. Firstly, perf isnt a tool that 
'interacts', it's an observation tool: just like 'top' is an observation tool.

We want to enable developers to see all activities on the system - regardless 
of who started the VM or who started the process. Imagine if we had a way to 
hide tasks to hide from 'top'. It would be rather awful.

Secondly, it tells us that the concept is fragile if it doesnt automatically 
enumerate all guests, regardless of how they were created.

Full system enumeration is generally best left to the kernel, as it can offer 
coherent access.

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 11:28 am

qemu-system-x86_64 -name Fedora,process=qemu-Fedora

Does exactly that.  We don't make this default based on the element of 
least surprise.  Many users expect to be able to do killall 


Perf does interact with a guest though because it queries a guest to 
read it's file system.

I understand the point you're making though.  If instead of doing a pull 
interface where the host queries the guest for files, if the guest 
pushed a small set of files at startup which the host cached, then you 
could potentially unconditionally expose a "read-only" socket that only 

I don't see why qemu can't offer coherent access.  The limitation today 
is intentional and if it's overly restrictive, we can figure out a means 
to change it.

Regards,


--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 10:11 am

Well, in a sense a guest is a KVM concept too: it's in essence represented via 
the 'vcpu state attached to a struct mm' abstraction that is attached to the 
/dev/kvm file descriptor attached to a Linux process.

Multiple vcpus can be started by the same process to represent SMP, but the 
whole guest notion is present: a Linux MM that carries KVM state.

In that sense when we type 'perf kvm list' we'd like to get a list of all 
currently present guests that the developer has permission to profile: i.e. 
we'd like a list of all [debuggable] Linux tasks that have a KVM instance 
attached to them.

A convenient way to do that would be to use the Qemu process's ->comm[] name, 
and to have a KVM ioctl that gets us a list of all vcpus that the querying 
task has ptrace permission to. [the standard permission check we do for 
instrumentation]

No need for communication with Qemu for that - just an ioctl, and an 
always-guaranteed result that works fine on a whole-system and on a per user 
basis as well.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 11:30 am

You need a way to interact with the guest which means you need some type 
of device.  All of the interesting devices are implemented in qemu so 
you're going to have to interact with qemu if you want meaningful 
interaction with a guest.

Regards,


--

From: Avi Kivity
Date: Monday, March 22, 2010 - 9:12 am

No, you're not.  You're trying to fracture the qemu community with your 
tools/kvm proposal, you're explaining to me how I'm working on the wrong 
thing by concentrating on things that my employer needs rather than what 
you think kvm needs, and attaching various unsavoury labels to Anthony 
and myself.  Any wonder we aren't getting anything done?

If you can commit to a reasonable conversation we might be able to make 

Usually 'layering violation' is trotted out at such suggestions.  I 
don't like using the term, because sometimes the layers are incorrect 
and need to be violated.  But it should be done explicitly, not as a 
shortcut for a minor feature (and profiling is a minor feature, most 
users will never use it, especially guest-from-host).

The fact is we have well defined layers today, kvm virtualizes the cpu 
and memory, qemu emulates devices for a single guest, libvirt manages 
guests.  We break this sometimes but there has to be a good reason.  So 
perf needs to talk to libvirt if it wants names.  Could be done via 

You simply kept ignoring me when I said that if something can be kept 
out of the kernel without impacting performance, it should be.  I don't 
want emergency patches closing some security hole or oops in a kernel 
symbol server.

The usability argument is a red herring.  True, it takes time for things 
to trickle down to distributions and users.  Those who can't wait can 

The impasse is mostly due to you insisting on doing everything your way, 
in the kernel, and disregarding how libvirt/qemu/kvm does things.  Learn 
the kvm ecosystem, you'll find it is quite easy to contribute code.

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Monday, March 22, 2010 - 9:16 am

Or rather, explained how I am a wicked microkernelist.  The herring were 
out in force today.

-- 
error compiling committee.c: too many arguments to function

--

From: Pekka Enberg
Date: Monday, March 22, 2010 - 9:40 am

Well, if it's not being a "wicked microkernelist" then what is it?
Performance is hardly the only motivation to put things into the
kernel. Think kernel mode-setting and devtmpfs (with the ironic twist
of original devfs being removed from the kernel) here, for example.

                        Pekka
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 11:06 am

Motivations include privileged device access, needing to access physical 
memory, security, and keeping the userspace interface sane.  There are 
others.  I don't think any of them hold here.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 9:51 am

[ Sidenote: i still received no adequate suggestions about how to provide this

That's weird, how can a feature request be a 'layering violation'?

If something that users find straightforward and usable is a layering 
violation to you (such as easily being able to access their own files on the 
host as well ...) then i think you need to revisit the definition of that term 

I never suggested an "in kernel space symbol server" which could oops, why 
would i have suggested that? Please point me to an email where i suggested 

It's not just "download and compile", it's also "configure correctly for 
several separate major distributions" and "configure to per guest instance 
local rules".

It's far more fragile in practice than you make it appear to be, and since you 
yourself expressed that you are not interested much in the tooling side, how 
can you have adequate experience to judge such matters?

In fact for instrumentation it's beyond a critical threshold of fragility - 
instrumentation above all needs to be accessible, transparent and robust.

If you cannot see the advantages of a properly integrated solution then i 
suspect there's not much i can do to convince you.

And you ignored not just me but you ignored several people in this thread who 
thought the current status quo was inadequate and expressed interest in both 
the VFS integration and in the guest enumeration features.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:08 am

You need to integrate with libvirt to convert guest names something that 

The 'something trustable and kernel-provided'.  The kernel knows nothing 


You insisted that it be in the kernel.  Later you relaxed that and said 
a daemon is fine.  I'm not going to reread this thread, once is more 

That's life in Linux-land.  Either you let distributions feed you cooked 
packages and relax, or you do the work yourself.  If we had 

People on kvm-devel manage to build and run release tarballs and even 
directly from git.  I build packages from source occasionally.  It isn't 

Integration in Linux happens at the desktop or distribution level.  You 
want to move it to the kernel level.  It works for perf, great, but that 
doesn't mean it will work for everything else.  Once perf grows a GUI, I 
expect it will stop working for perf as well (for example, if gtk breaks 

I'm sorry.  I don't reply to every email.  If you want my opinion on 
something, you can ask me again.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 10:34 am

The kernel certainly knows about other resources such as task names or network 

This is really just the much-discredited microkernel approach for keeping 
global enumeration data that should be kept by the kernel ...

Lets look at the ${HOME}/.qemu/qmp/ enumeration method suggested by Anthony. 
There's numerous ways that this can break:

 - Those special files can get corrupted, mis-setup, get out of sync, or can
   be hard to discover.

 - The ${HOME}/.qemu/qmp/ solution suggested by Anthony has a very obvious
   design flaw: it is per user. When i'm root i'd like to query _all_ current
   guest images, not just the ones started by root. A system might not even
   have a notion of '${HOME}'.

 - Apps might start KVM vcpu instances without adhering to the
   ${HOME}/.qemu/qmp/ access method.

 - There is no guarantee for the Qemu process to reply to a request - while
   the kernel can always guarantee an enumeration result. I dont want 'perf 
   kvm' to hang or misbehave just because Qemu has hung.

Really, for such reasons user-space is pretty poor at doing system-wide 
enumeration and resource management. Microkernels lost for a reason.

You are committing several grave design mistakes here.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 10:55 am

But it doesn't know about guest names.  You can't trust task names since 
any user can create a task with any name.  Network interfaces are root 
only so you can trust their names.

There are dozens or even hundreds of object classes the kernel does not 
know about and cannot enumerate.  User names, for instance. X sessions.  
Windows (the screen artifact, not the OS).  CIFS shares exported by this 
machine.  Currently running applications (not processes).

btw, network interfaces would have been much better of using 

I disagree it should be kept in the kernel.  Why introduce a new 
namespace, with APIs to query it, manage it, rules regarding conflicts, 




Take a look at your desktop, userspace is doing all of that everywhere, 
from enumerating users and groups, to deciding how your disks are 

I am committing on the shoulders of giants.

-- 
error compiling committee.c: too many arguments to function

--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 12:15 pm

We're stuck in a rut with libvirt and I think a lot of the 
dissatisfaction with qemu is rooted in that.  It's not libvirt that's 
the probably, but the relationship between qemu and libvirt.

We add a feature to qemu and maybe after six month it gets exposed by 
libvirt.  Release time lines of the two projects complicate the 
situation further.  People that write GUIs are limited by libvirt 
because that's what they're told to use and when they need something 
simple, they're presented with first getting that feature implemented in 
qemu, then plumbed through libvirt.

It wouldn't be so bad if libvirt was basically a passthrough interface 
to qemu but it tries to model everything in a generic way which is more 
or less doomed to fail when you're adding lots of new features (as we are).

The list of things that libvirt doesn't support and won't any time soon 
is staggering.

libvirt serves an important purpose, but we need to do a better job in 
qemu with respect to usability.  We can't just punt to libvirt.

Regards,

Anthony Liguori

--

From: Daniel P. Berrange
Date: Monday, March 22, 2010 - 12:31 pm

That is somewhat unfair as a blanket statement! 

While some features have had a long time delay & others are not supported
at all, in many cases we have had zero delay. We have been supporting QMP,
qdev, vhost-net since before the patches for those features were even merged
in QEMU GIT! It varies depending on how closely QEMU & libvirt people have
been working together on a feature, and on how strongly end users are demanding

As previously discussed, we want to improve both the set of features
supported, and make it much easier to support new features promptly.
The QMP & qdev stuff has been a very good step forward in making it
easier to support QEMU management. There have been a proposals from 
several people, yourself included, on how to improve libvirt's support
for the full range of QEMU features. We're committed to looking at this
and figuring out which proposals are practical to support, so we can
improve QEMU & libvirt interaction for everyone.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 12:33 pm

Sorry, you're certainly correct.  Some features appear quickly, but 

Regards,


--

From: Alexander Graf
Date: Monday, March 22, 2010 - 12:39 pm

Yes. I think the point was that every layer in between brings potential slowdown and loss of features.

Hopefully this will go away with QMP. By then people can decide if they want to be hypervisor agnostic (libvirt) or tightly coupled with qemu (QMP). The best of both worlds would of course be a QMP pass-through in libvirt. No idea if that's easily possible.

Either way, things are improving. What people see at the end is virt-manager though. And if you compare if feature-wise as well as looks-wise vbox is simply superior. Several features lacking in lower layers too (pv graphics, always working absolute pointers, clipboard sharing, ...).

That said it doesn't mean we should resign. It means we know which areas to work on :-). And we know that our problem is not the kernel/userspace interface, but the qemu and above interfaces.

Alex--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:54 pm

Exactly. The more 'fragmented' a project is into sub-projects, without a 
single, unified, functional reference implementation in the center of it, the 
longer it takes to fix 'unsexy' problems like trivial usability bugs.

Furthermore, another negative effect is that many times features are 
implemented not in their technically best way, but in a way to keep them local 
to the project that originates them. This is done to keep deployment latencies 
and general contribution overhead down to a minimum. The moment you have to 
work with yet another project, the overhead adds up.

So developers rather go for the quicker (yet inferior) hack within the 
sub-project they have best access to.

Tell me this isnt happening in this space ;-)

Thanks,

	Ingo
--

From: Alexander Graf
Date: Monday, March 22, 2010 - 12:58 pm

I disagree there. Keeping things local and self-contained has been the UNIX secret. It works really well as long as the boundaries are well defined.


Well - not necessarily hacks. It's more about project boundaries. Nothing is bad about that. You wouldn't want "ls" implemented in the Linux kernel either, right? :-)


Alex--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:21 pm

The 'UNIX secret' works for text driven pipelined commands where we are 
essentially programming via narrow ASCII input of mathematical logic.

It doesnt work for a GUI that is a 2D/3D environment of millions of pixels, 

Have you made thoughts about why that might be so?

I think it's because of what i outlined above - that you are trying to apply 
the "UNIX secret" to GUIs - and that is a mistake.

A good GUI is almost at the _exact opposite spectrum_ of good command-line 
tool: tightly integrated, with 'layering violations' designed into it all over 
the place:

  look i can paste the text from an editor straight into a firefox form. I
  didnt go through any hiearchy of layers, i just took the shortest path 
  between the apps!

In other words: in a GUI the output controls the design, for command-line 
tools the design controls the output.

It is no wonder Unix always had its problems with creating good GUIs that are 
efficient to humans. A good GUI works like the human brain, and the human 
brain does not mind 'layering violations' when that gets it a more efficient 

I guess you are talking to the wrong person as i actually have implemented ls 
functionality in the kernel, using async IO concepts and extreme threading ;-) 
It was a bit crazy, but was also the fastest FTP server ever running on this 
planet.

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:35 pm

Modularization is needed when a project exceeds the average developer's 
capacity.  For kvm,  it is logical to separate privileged cpu 
virtualization, from guest virtualization, from host management, from 

Nope.  You copied text from one application into the clipboard (or 
selection, or PRIMARY, or whatever
) and pasted text from the clipboard to another application.  If firefox 
and your editor had to interact directly, all would be lost.

See - there was a global (for the session) third party, and it wasn't 


The problem is that only developers are involved, not people who 
understand human-computer interaction (in many cases, not human-human 
interaction either).  Another problem is that a good GUI takes a lot of 
work so you need a lot of committed resources.  A third problem is that 
it isn't a lot of fun, at least not the 20% of the work that take 800% 
of the time.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Bernd Petrovitsch
Date: Tuesday, March 23, 2010 - 3:48 am

On Mon, 2010-03-22 at 21:21 +0100, Ingo Molnar wrote:

Yes.

Forword: I assume with "GUI" you mean "a user interface for the
classical desktop user with next to no interest in learning details or
basics".
That doesn't mean the classical desktop user is silly, stupid or

No, it's the very same mechanism. But you just have to start at the
correct point. In the kernel/device driver world, you start at the
device.
And in the GUI world, you better start at the GUI (and not some kernel
ACK, because you to make the GUI understandable to the intended users.
If that means "hiding 90% of all possibilities and features", you just
hide them.
Of course, the user of such an UI is quite limited doesn't use much of
the functionality - because s/he can't access it through the GUI - (but
presenting 100% - or even 40% - doesn't help either as s/he won't
ACK, because the user in this case (which is most of the time a
developer, sys-admin, or similar techie) *wants* an 1:1 picture of the
underlying model because s/he already *knows* the underlying model (and

ACK. The clichee-Unix-person doesn't come from the "GUI world". So most

If this is the case, the layering/structure/design of the GUI is (very)
badly defined/chosen (for whatever reason).

[ Most probably because some seasoned software developer designed the
GUI-app *without* designing (and testing!) the GUI (or more to the
point: the look - how does it look like - and feel - how does it behave,
what are the possible workflows, ... - of it) first. ]

	Bernd
-- 
Bernd Petrovitsch                  Email : bernd@petrovitsch.priv.at
                     LUGA : http://www.luga.at

--

From: Antoine Martin
Date: Monday, March 22, 2010 - 1:19 pm

Integration is hard, requires a wider set of technical skills and 
getting good test coverage becomes more difficult.
But I agree that it is worth the effort, kvm could reap large rewards 
from putting a greater emphasis on integration (ala vbox) - no matter 
how it is achieved (cowardly not taking sides on implementation 
decisions like repository locations).


--

From: Antoine Martin
Date: Monday, March 22, 2010 - 1:00 pm

+1
The obvious reason why so many people still use shell scripts rather 
than libvirt is because if it just doesn't provide what they need.
Every time I've looked at it (and I've been looking for a better 
solution for many years), it seems that it would have provided most of 
the things I needed, but the remaining bits were unsolvable.

Shell scripts can be ugly, but you get total control.


--

From: Daniel P. Berrange
Date: Monday, March 22, 2010 - 1:58 pm

If you happen to remember what missing features prevented you choosing
libvirt, that would be invaluable information for us, to see if there
are quick wins that will help out. We got very useful feedback when
recently asking people this same question

http://rwmj.wordpress.com/2010/01/07/quick-quiz-what-stops-you-from-using-libvirt/

Allowing arbitrary passthrough of QEMU commands/args will solve some of
these issues, but certainly far from solving all of them. eg guest cut+
paste, host side control of guest screen resolution, easier x509/TLS 
configuration for remote management, soft reboot, Windows desktop support
for virt-manager, host network interface management/setup, etc

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:20 pm

Which has pretty much the same problems to the ${HOME}/.qemu/qmp/ solution, 


Erm, but i'm talking about a dead tool here. There's a world of a difference 
between 'kvm top' not showing new entries (because the guest is dead), and 
'perf kvm top' hanging due to Qemu hanging.

So it's essentially 4 our of 4. Yet your reply isnt "Ingo you are right" but 

We dont do that for robust system instrumentation, for heaven's sake!

By your argument it would be perfectly fine to implement /proc purely via 

Really, this is getting outright ridiculous. You agree with me that Anothony 
suggested a technically inferior solution, yet you even seem to be proud of it 
and are joking about it?

And _you_ are complaining about lkml-style hard-talk discussions?

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:44 pm

It doesn't follow.  The libvirt daemon could/should own guests from all 
users.  I don't know if it does so now, but nothing is preventing it 



My reply is "you are right" (phrased earlier as "I don't like it either" 
meaning I agree with your dislike).  One of your criticisms was invalid, 

If qemu fails, you lose your guest.  If libvirt forgets about a guest, 
you can't do anything with it any more.  These are more serious problems 
than 'perf kvm' not working.  Qemu and libvirt have to be robust anyway, 
we can rely on them.  Like we have to rely on init, X, sshd, and a 

I would have preferred /proc to be implemented via syscalls called 
directly from tools, and good tools written to expose the information in 
it.  When computers were slower 'top' would spend tons of time opening 
and closing all those tiny files and parsing them.  Of course the kernel 


In every Linux system userspace is doing or proxying much of the 
enumeration and resource management.  So if enumerating guests in 

There is a difference between joking and insulting people.  I enjoy 
jokes but I dislike being insulted.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:06 pm

It's hard for me to argue against a hypothetical implementation, but all 
user-space driven solutions for resource enumeration i've seen so far had 

I think you didnt understand my point. I am talking about 'perf kvm top' 
hanging if Qemu hangs.

With a proper in-kernel enumeration the kernel would always guarantee the 
functionality, even if the vcpu does not make progress (i.e. it's "hung").

With this implemented in Qemu we lose that kind of robustness guarantee.

And especially during development (when developers use instrumentation the 
most) is it important to have robust instrumentation that does not hang along 

How on earth can you justify a bug ("perf kvm top" hanging) with that there 
are other bugs as well?

Basically you are arguing the equivalent that a gdb session would be fine to 
become unresponsive if the debugged task hangs. Fortunately ptrace is 
kernel-based and it never 'hangs' if the user-space process hangs somewhere.

This is an essential property of good instrumentation.

So the enumeration method you suggested is a poor, sub-part solution, simple 

We can still profile any of those tools without the profiler breaking if the 

(Then you'll be enjoyed to hear that perf has enabled exactly that, and that we 
are working towards that precise usecase.)

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:15 pm

Use non-blocking I/O, report that guest as dead.  No point in profiling 

If qemu has a bug in the resource enumeration code, you can't profile 
one guest.  If the kernel has a bug in the resource enumeration code, 

It's nice not to have kernel oopses either.  So when code can be in 

There's no reason for 'perf kvm top' to hang if some process is not 




Are you exporting /proc/pid data via the perf syscall?  If so, I think 
that's a good move.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:29 pm

Erm, at what point do i decide that a guest is 'dead' versus 'just lagged due 
to lots of IO' ?

Also, do you realize that you increase complexity (the use of non-blocking 
IO), just to protect against something that wouldnt happen if the right 

This is really simple code, not rocket science. If there's a bug in it we'll 
fix it. On the other hand a 500KLOC+ piece of Qemu code has lots of places to 
hang, so that is a large cross section.

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:40 pm

qemu shouldn't block due to I/O (it does now, but there is work to fix 
it).  Of course it could be swapping or other things.

Pick a timeout, everything we do has timeouts these days.  It's the 
price we pay for protection: if you put something where a failure can't 
hurt you, you have to be prepared for failure, and you might have false 
alarms.

Is it so horrible for 'perf kvm top'?  No user data loss will happen, 
surely?

On the other hand, if it's in the kernel and it fails, you will lose 

It's a tradeoff.  Increasing the kernel code size vs. increasing 

The kernel has tons of very simple code (and some very complex code as 
well), and tons of -stable updates as well.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 11:35 am

Not all KVM vcpus are running operating systems.

Transitive had a product that was using a KVM context to run their 
binary translator which allowed them full access to the host processes 
virtual address space range.  In this case, there is no kernel and there 
are no devices.

That's what I mean by a guest being a userspace context.  KVM simply 
provides a new CPU mode to userspace in the same way that vm8086 mode.

Regards,


--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:22 pm

And your point is that such vcpus should be excluded from profiling just 
because they fall outside the Qemu/libvirt umbrella?

That is a ridiculous position.

	Ingo
--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 12:29 pm

You don't instrument it the way you'd instrument an operating system so 
no, you don't want it to show up in perf kvm top.

Regards,


--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:32 pm

Erm, why not? It's executing a virtualized CPU, so sure it makes sense to 
allow the profiling of it!

It might even not be the weird case you mentioned by some competing 
virtualization project to Qemu ...

So your argument is wrong on several technical levels, sorry.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:43 pm

It may not make sense to have symbol tables for it, for example it isn't 
generated from source code but from binary code for another architecture.

Of course, just showing addresses is fine, but you don't need qemu for that.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:45 pm

Non-guest vcpus will not be able to provide Linux-style symbols.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:35 pm

And why do you say that it makes no sense to profile them?

Also, why do you define 'guest vcpus' to be 'Qemu started guest vcpus'? If 
some other KVM using project (which you encouraged just a few mails ago) 
starts a vcpu we still want to be able to profile them.

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:45 pm

It makes sense to profile them, but you don't need to contact their 

Maybe it should provide a mechanism for libvirt to list it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 11:41 am

If your position basically boils down to, we can't trust userspace and 
we can always trust the kernel, I want to eliminate any userspace path, 
then I can't really help you out.

I believe we can come up with an infrastructure that satisfies your 
actual requirements within qemu but if you're also insisting upon the 
above implementation detail then there's nothing I can do.

Regards,

Anthony Liguori

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 12:27 pm

Why would you want to 'help me out'? I can tell a good solution from a bad one 
just fine.

You should instead read the long list of disadvantages above, invert them and 
list then as advantages for the kernel-based vcpu enumeration solution, apply 
common sense and go admit to yourself that indeed in this situation a kernel 
provided enumeration of vcpu contexts is the most robust solution.

It's really as simple as that :-)

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:47 pm

You are basically making a kernel implementation a requirement, instead 

Having qemu enumerate guests one way or another is not a good idea IMO 
since it is focused on one guest and doesn't have a system-wide entity.  
A userspace system-wide entity will work just as well as kernel 
implementation, without its disadvantages.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 1:46 pm

A system-wide user-space entity only solves one problem out of the 4 i listed, 
still leaving the other 3:

 - Those special files can get corrupted, mis-setup, get out of sync, or can
   be hard to discover.

 - Apps might start KVM vcpu instances without adhering to the
   system-wide access method.

 - There is no guarantee for the system-wide process to reply to a request -
   while the kernel can always guarantee an enumeration result. I dont want
   'perf kvm' to hang or misbehave just because the system-wide entity has 
   hung.

Really, i think i have to give up and not try to convince you guys about this 
anymore - i dont think you are arguing constructively anymore and i dont want 
yet another pointless flamewar about this.

Please consider 'perf kvm' scrapped indefinitely, due to lack of robust KVM 
instrumentation features: due to lack of robust+universal vcpu/guest 
enumeration and due to lack of robust+universal symbol access on the KVM side. 
It was a really promising feature IMO and i invested two days of arguments 
into it trying to find a workable solution, but it was not to be.

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 1:53 pm

That's a hard requirement anyway.  If it happens, we get massive data 
loss.  Way more troubling than 'perf kvm top' doesn't work.  So consider 

Then you don't get their symbol tables.  That happens anyway if the 
symbol server is not installed, not running, handing out fake data.  So 

When you press a key there is no guarantee no component along the way 

I am not going to push libvirt or a subset thereof into the kernel for 
'perf kvm'.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Anthony Liguori
Date: Monday, March 22, 2010 - 3:06 pm

There always needs to be a system wide entity.  There are two ways to 
enumerate instances from that system wide entity.  You can centralize 
the creation of instances and there by maintain an list of current 
instances.  You can also allow instances to be created in a 
decentralized manner and provide a standard mechanism for instances to 
register themselves with the system wide entity.

IOW, it's the difference between asking libvirtd to exec(qemu) vs 
allowing a user to exec(qemu) and having qemu connect to a well known 
unix domain socket for libvirt to tell libvirtd that it exists.

The later approach has a number of advantages.  libvirt already supports 
both models.  The former is the '/system' uri and the later is the 
'/session' uri.

What I'm proposing, is to use the host file system as the system wide 
entity instead of libvirtd.  libvirtd can monitor the host file system 
to participate in these activities but ultimately, moving this 
functionality out of libvirtd means that it becomes the standard 
mechanism for all qemu instances regardless of how they're launched.

Regards,


--

From: Avi Kivity
Date: Tuesday, March 23, 2010 - 2:07 am

I don't like dropping sockets into the host filesystem, especially as 
they won't be cleaned up on abnormal exit.  I also think this breaks our 
'mechanism, not policy' policy.  Someone may want to do something weird 
with qemu that doesn't work well with this.

We could allow starting monitors from the global configuration file, so 
a distribution can do this if it wants, but I don't think we should do 
this ourselves by default.

-- 
error compiling committee.c: too many arguments to function

--

From: Anthony Liguori
Date: Tuesday, March 23, 2010 - 7:09 am

The approach I've taken (which I accidentally committed and reverted) 
was to set this up as the default qmp device much like we have a default 
monitor device.  A user is capable of overriding this by manually 

I've looked at making default devices globally configurable.  We'll get 
there but I think that's orthogonal to setting up a useful default qmp 
device.

Regards,

Anthony Liguori

--

From: Kevin Wolf
Date: Tuesday, March 23, 2010 - 3:13 am

I think the latter is exactly what I would want for myself. I do see the
advantages of having a central instance, but I really don't want to
bother with libvirt configuration files or even GUIs just to get an
ad-hoc VM up when I can simply run "qemu -hda hd.img -m 1024". Let alone
that I usually want to have full control over qemu, including monitor
access and small details available as command line options.

I know that I'm not the average user with these requirements, but still
I am one user and do have these requirements. If I could just install
libvirt, continue using qemu as I always did and libvirt picked my VMs
up for things like global enumeration, that would be more or less the
optimal thing for me.

Kevin
--

From: Antoine Martin
Date: Tuesday, March 23, 2010 - 3:28 am

+1
And it would also make it more likely that users like us would convert 
to libvirt in the long run, by providing an easy and integrated 
transition path.
I've had another look at libvirt, and one of the things that is holding 
me back is the cost of moving existing scripts to libvirt. If it could 
just pick up what I have (at least in part), then I don't have to.


--

From: Joerg Roedel
Date: Tuesday, March 23, 2010 - 7:06 am

And this system wide entity is the kvm module. It creates instances of
'struct kvm' and destroys them. I see no problem if we just attach a
name to every instance with a good default value like kvm0, kvm1 ... or
guest0, guest1 ... User-space can override the name if it wants. The kvm
module takes care about the names being unique.
This is very much the same as network card numbering is implemented in
the kernel.
Forcing perf to talk to qemu or even libvirt produces to much overhead
imho. Instrumentation only produces useful results with low overhead.

	Joerg

--

From: Avi Kivity
Date: Tuesday, March 23, 2010 - 9:39 am

So, two users can't have a guest named MyGuest each?  What about 
namespace support?  There's a lot of work in virtualizing all kernel 
namespaces, you're adding to that.  What about notifications when guests 

It's a setup cost only.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Tuesday, March 23, 2010 - 11:21 am

This enumeration is a very small and non-intrusive feature. Making it

Who would be the consumer of such notifications? A 'perf kvm list' can

My statement was not limited to enumeration, I should have been more
clear about that. The guest filesystem access-channel is another
affected part. The 'perf kvm top' command will access the guest
filesystem regularly and going over qemu would be more overhead here.
Providing this in the KVM module directly also has the benefit that it
would work out-of-the-box with different userspaces too.  Or do we want
to limit 'perf kvm' to the libvirt-qemu-kvm software stack?

Sidenote: I really think we should come to a conclusion about the
          concept. KVM integration into perf is very useful feature to
	  analyze virtualization workloads.

Thanks,

	Joerg

--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 11:27 am

I always start my things with bare kvm, It would be very unwelcome to
mandate libvirt, or for that matter running a particular userspace in
the guest.
--

From: Javier Guerra Giraldez
Date: Tuesday, March 23, 2010 - 12:05 pm

an outsider's comment: this path leads to a filesystem... which could
be a very nice idea.  it could have a directory for each VM, with
pseudo-files with all the guest's status, and even the memory it's
using.  perf could simply watch those files.   in fact, such a
filesystem could be the main userleve/kernel interface.

but i'm sure such a layour was considered (and rejected) very early in
the KVM design.  i don't think there's anything new to make it more
desirable than it was back then.


-- 
Javier
--

From: Avi Kivity
Date: Tuesday, March 23, 2010 - 9:57 pm

It's easier (and safer and all the other boring bits) not to do it at 

System-wide monitoring needs to work equally well for guests started 
before or after the monitor.  Even disregarding that, if you introduce 
an API, people will start using it and complaining if it's incomplete.


Why?  Also, the real cost would be accessing the filesystem, not copying 

Other userspaces can also provide this functionality, like they have to 
provide disk, network, and display emulation.  The kernel is not a huge 

Agreed.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 4:59 am

For the KVM stack is doesn't matter where it is implemented. It is as
easy in qemu or libvirt as in the kernel. I also don't see big risks. On
the perf side and for its users it is a lot easier to have this in the
kernel.
I for example always use plain qemu when running kvm guests and never
used libvirt. The only central entity I have here is the kvm kernel

Could be easily done using notifier chains already in the kernel.

There is nothing wrong with that. We only need to define what this API
should be used for to prevent rank growth. It could be an

When measuring cache-misses any additional (and in this case

This has nothing to do with a library. It is about entity and resource
management which is what os kernels are about. The virtual machine is
the entity (similar to a process) and we want to add additional access
channels and names to it.

        Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 5:08 am

You can always provide the kernel and module paths as command line 
parameters.  It just won't be transparently usable, but if you're using 


If we make an API, I'd like it to be generally useful.

It's a total headache.  For example, we'd need security module hooks to 
determine access permissions.  So far we managed to avoid that since kvm 
doesn't allow you to access any information beyond what you provided it 

Copying the objects is a one time cost.  If you run perf for more than a 
second or two, it would fetch and cache all of the data.  It's really 

kvm.ko has only a small subset of the information that is used to define 
a guest.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 5:50 am

I don't want the tool for myself only. A typical perf user expects that

Not necessarily. The perf event is configured to measure systemwide kvm
by userspace. The kernel side of perf takes care that it stays
system-wide even with added vm instances. So in this case the consumer
for the notifier would be the perf kernel part. No userspace interface

Thats hard to do at this point since we don't know what people will use
it for. We should keep it simple in the beginning and add new features

Depends on how it is designed. A filesystem approach was already
mentioned. We could create /sys/kvm/ for example to expose information
about virtual machines to userspace. This would not require any new

I don't think we can cache filesystem data of a running guest on the

If two userspaces run in parallel what is the single instance where perf

The subset is not small. It contains all guest vcpus, the complete
interrupt routing hardware emulation and manages event the guests
memory.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 6:05 am

Someone needs to know about the new guest to fetch its symbols.  Or do 

IMO this use case is to rare to warrant its own API, especially as there 

Who would set the security context on those files?  Plus, we need cgroup 

I don't see any choice.  The guest can change its symbols at any time 


It doesn't contain most of the mmio and pio address space.  Integration 
with qemu would allow perf to tell us that the guest is hitting the 
interrupt status register of a virtio-blk device in pci slot 5 (the 
information is already available through the kvm_mmio trace event, but 
only qemu can decode it).

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 6:46 am

Someone who uses libvirt and virt-manager by default is probably not
interested in this feature at the same level a kvm developer is. And
developers tend not to use libvirt for low-level kvm development.  A
number of developers have stated in this thread already that they would
appreciate a solution for guest enumeration that would not involve

The samples will be tagged with the guest-name (and some additional
information perf needs). Perf userspace can access the symbols then

An approach like: "The files are owned and only readable by the same
user that started the vm." might be a good start. So a user can measure

cgroup support is an issue but we can solve that too. Its in general

Yeah that would be interesting information. But it is more related to
tracing than to pmu measurements.
The information which you mentioned above are probably better
captured by an extension of trace-events to userspace.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 6:57 am

So would I.  But when I weigh the benefit of truly transparent 
system-wide perf integration for users who don't use libvirt but do use 
perf, versus the cost of transforming kvm from a single-process API to a 
system-wide API with all the complications that I've listed, it comes 
out in favour of not adding the API.


I take that as a yes?  So we need a virtio-serial client in the kernel 
(which might be exploitable by a malicious guest if buggy) and a 

That's not how sVirt works.  sVirt isolates a user's VMs from each 
other, so if a guest breaks into qemu it can't break into other guests 
owned by the same user.

The users who need this API (!libvirt and perf) probably don't care 

It's a tradeoff.  IMO, going through qemu is the better way, and also 

It's all related.  You start with perf, see a problem with mmio, call up 
a histogram of mmio or interrupts or whatever, then zoom in on the 
misbehaving device.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 8:01 am

Its not a transformation, its an extension. The current per-process
/dev/kvm stays mostly untouched. Its all about having something like
this:

$ cd /sys/kvm/guest0
$ ls -l
-r-------- 1 root root 0 2009-08-17 12:05 name
dr-x------ 1 root root 0 2009-08-17 12:05 fs
$ cat name
guest0
$ # ...


What I meant was: perf-kernel puts the guest-name into every sample and
perf-userspace accesses /sys/kvm/guest_name/fs/ later to resolve the
symbols. I leave the question of how the guest-fs is exposed to the host

If a vm breaks into qemu it can access the host file system which is the
bigger problem. In this case there is no isolation anymore. From that
context it can even kill other VMs of the same user independent of a

Yes, but its different from the implementation point-of-view. For the
user it surely all plays together.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 8:12 am

How I see it: perf-kernel puts the guest pid into every sample, and 
perf-userspace uses that to resolve to a mountpoint served by fuse, or 

It cannot.  sVirt labels the disk image and other files qemu needs with 
the appropriate label, and everything else is off limits.  Even if you 

We need qemu to cooperate for mmio tracing, and we can cooperate with 
qemu for symbol resolution.  If it prevents adding another kernel API, 
that's a win from my POV.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 8:46 am

I am not tied to /sys/kvm. We could also use /proc/<pid>/kvm/ for
example. This would keep anything in the process space (except for the

We need a bit more information than just the qemu-pid, but yes, this


Thats true. Probably qemu can inject this information in the
kvm-trace-events stream.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 8:49 am

How about ~/.qemu/guests/$pid?

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 8:59 am

That makes it hard for perf to find it and even harder to get a list of
all VMs. With /proc/<pid>/kvm/guest we could symlink all guest
directories to /proc/kvm/ and perf reads the list from there. Also perf
can easily derive the directory for a guest from its pid.
Last but not least its kernel-created and thus independent from the
userspace part being used.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:09 am

Doesn't perf already has a dependency on naming conventions for finding 
debug information?

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 9:40 am

Not so trival and even more likely to break. Even it perf has the pid of
the process and wants to find the directory it has to do:

1. Get the uid of the process
2. Find the username for the uid
3. Use the username to find the home-directory

Steps 2. and 3. need nsswitch and/or pam access to get this information
from whatever source the admin has configured. And depending on what the
source is it may be temporarily unavailable causing nasty timeouts. In
short, there are many weak parts in that chain making it more likely to
break.
A kernel-based approach with /proc/<pid>/kvm does not have those issues
(and to repeat myself, it is independent from the userspace being used).

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:47 am

It's true.  If the kernel provides something, there are fewer things 
that can break.  But if your system is so broken that you can't resolve 
uids, fix that before running perf.  Must we design perf for that case?

After all, 'ls -l' will break under the same circumstances.  It's hard 

It has other issues, which are IMO more problematic.

-- 
error compiling committee.c: too many arguments to function

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:52 am

Also, perf itself will hang if it needs to access a file using autofs or 
nfs, and those are broken.

-- 
error compiling committee.c: too many arguments to function

--

From: Antoine Martin
Date: Thursday, April 8, 2010 - 7:29 am

uid to username can fail when using chroots, or worse point to an
incorrect location (and yes, I do use this)

Sorry if this has been covered / discussion has moved on. Just catching
up with the 500+ messages in my inbox..

--

From: Arnaldo Carvalho de Melo
Date: Wednesday, March 24, 2010 - 10:47 am

It looks at several places, from most symbol rich (/usr/lib/debug/, aka
-debuginfo packages, where we have full symtabs) to poorest (the
packaged binary, where we may just have a .dynsym).

In an ideal world, it would just get the build-id (a SHA1 cookie that is
in an ELF session inserted in every binary (aka DSOs), kernel module,
kallsyms or vmlinux file) and use that to look first in a local cache
(implemented in perf for a long time already) or in some symbol server.

For instance, for a random perf.data file I collected here in my machine
I have:

[acme@doppio linux-2.6-tip]$ perf buildid-list | grep libpthread
5c68f7afeb33309c78037e374b0deee84dd441f6 /lib64/libpthread-2.10.2.so
[acme@doppio linux-2.6-tip]$

So I don't have to access /lib64/libpthread-2.10.2.so directly, nor some
convention to get a debuginfo in a local file like:

/usr/lib/debug/lib64/libpthread-2.10.2.so.debug

Instead the tools look at:

[acme@doppio linux-2.6-tip]$ l ~/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6
lrwxrwxrwx 1 acme acme 73 2010-01-06 18:53 /home/acme/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 -> ../../lib64/libpthread-2.10.2.so/5c68f7afeb33309c78037e374b0deee84dd441f6*

To find the file for that specific build-id, not the one installed in my
machine (or on the different machine, of a different architecture) that
may be completely unrelated, a new one, or one for a different arch.

- Arnaldo
--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 11:20 am

Thanks.  I believe qemu could easily act as a symbol server for this use 
case.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Arnaldo Carvalho de Melo
Date: Wednesday, March 24, 2010 - 11:27 am

Agreed, but it doesn't even have to :-)

We just need to get the build-id in the PERF_RECORD_MMAP event somehow
and then get this symbol from elsewhere, say the same DVD/RHN
channel/Debian Repository/embedded developer toolkit image not
stripped/whatever.

Or it may already be in the local cache from last week's perf report
session :-)

- Arnaldo
--

From: Zhang, Yanmin
Date: Thursday, March 25, 2010 - 2:00 am

I spent a couple of days to investigate why sshfs/fuse doesn't work well with
procfs and sysfs. Just after my patch against fuse is ready almost, I found
fuse already supports such access by direct I/O. With parameter -o direct_io,
it could work well.

Here is an example to mount / from a guest os.
#sshfs -p 5551 -o direct_io localhost:/ guestmount

We can read files and write files if permission is ok.

I will go ahead to support multiple guest os instance statistics parsing.

Yanmin


--

From: Daniel P. Berrange
Date: Wednesday, March 24, 2010 - 8:26 am

No it can't. With sVirt every single VM has a custom security label and
the policy only allows it access to disks / files with a matching label,
and prevents it attacking any other VMs or processes on the host. THis
confines the scope of any exploit in QEMU to those resources the admin
has explicitly assigned to the guest.

Regards,
Daniel
-- 
|: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 8:37 am

Even better. So a guest which breaks out can't even access its own
/sys/kvm/ directory. Perfect, it doesn't need that access anyway.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 8:43 am

But what security label does that directory have?  How can we make sure 
that whoever needs access to those files, gets them?

Automatically created objects don't work well with that model.  They're 
simply missing information.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 8:50 am

If we go the /proc/<pid>/kvm way then the directory should probably
inherit the label from /proc/<pid>/?
Same could be applied to /sys/kvm/guest/ if we decide for it. The VM is
still bound to a single process with a /proc/<pid> after all.

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 8:52 am

That's a security policy.  The security people like their policies 
outside the kernel.

For example, they may want a label that allows a trace context to read 

Ditto.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 9:17 am

Hm, I am not a security expert. But is this not only one entity more for
sVirt to handle? I would leave that decision to the sVirt developers.
Does attaching the same label as for the VM resources mean that root
could not access it anymore?

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:20 am

IIUC processes run under a context, and there's a policy somewhere that 
tells you which context can access which label (and with what 
permissions).  There was a server on the Internet once that gave you 
root access and invited you to attack it.  No idea if anyone succeeded 
or not (I got bored after about a minute).

So it depends on the policy.  If you attach the same label, that means 
all files with the same label have the same access permissions.  I think.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 9:31 am

So if this is true we can introduce a 'trace' label and add all contexts
that should be allowed to trace to it.
But we probably should leave the details to the security experts ;-)

	Joerg

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:32 am

That's just what I want to do.  Leave it in userspace and then they can 
deal with it without telling us about it.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 9:45 am

They can't do that with a directory in /proc?

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:48 am

I don't know.

-- 
error compiling committee.c: too many arguments to function

--

From: Peter Zijlstra
Date: Wednesday, March 24, 2010 - 9:03 am

I'd much prefer a pid like suggested later, keeps the samples smaller.

But that said, we need guest kernel events like mmap and context
switches too, otherwise we simply can't make sense of guest userspace
addresses, we need to know the guest address space layout.

So aside from a filesystem content, we first need mmap and context
switch events to find the files we need to access.

And while I appreciate all the security talk, its basically pointless
anyway, the host can access it anyway, everybody agrees on that, but
still you're arguing the case..
--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 9:16 am

This only works for the guest kernel, we don't know anything about guest 

root can access anything, but we're not talking about root.  The idea is 
to protect against a guest that has exploited its qemu and is now 
attacking the host and its fellow guests.   uid protection is no good 
since we want to isolate the guest from host processes belonging to the 
same uid and from other guests running under the same uid.

[1] We can find out guest pids if we teach the kernel what to 
dereference, i.e. gs:offset1->offset2->offset3.  Of course this varies 
from kernel to kernel, so we need some kind of bytecode that we can run 
in perf nmi context.  Kind of what we need to run an unwinder for 
-fomit-frame-pointer.

-- 
error compiling committee.c: too many arguments to function

--

From: Joerg Roedel
Date: Wednesday, March 24, 2010 - 9:23 am

With the filesystem approach all we need is the pid of the guest
process. Then we can access proc/<pid>/maps of the guest and read out the
address space layout, no?

	Joerg

--

From: Peter Zijlstra
Date: Wednesday, March 24, 2010 - 9:45 am

No, what if it maps new things after you read it? But still getting the
pid of the guest process seems non trivial without guest kernel support.
--

From: Alexander Graf
Date: Wednesday, March 24, 2010 - 6:53 am

How about we add a virtio "guest file system access" device? The guest
would then expose its own file system using that device.

On the host side this would simply be a -virtioguestfs
unix:/tmp/guest.fs and you'd get a unix socket that gives you full
access to the guest file system by using commands. I envision something
like:

SEND: GET /proc/version
RECV: Linux version 2.6.27.37-0.1-default (geeko@buildhost) (gcc version
4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP 2009-10-15
14:56:58 +0200

Now all we need is integration in perf to enumerate virtual machines
based on libvirt. If you want to run qemu-kvm directly, just go with
--guestfs=/tmp/guest.fs and perf could fetch all required information
automatically.

This should solve all issues while staying 100% in user space, right?


Alex

--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 6:59 am

The idea is to use a dedicated channel over virtio-serial.  If the 

Yeah, needs a fuse filesystem to populate the host namespace (kind of 
sshfs over virtio-serial).

-- 
error compiling committee.c: too many arguments to function

--

From: Alexander Graf
Date: Wednesday, March 24, 2010 - 7:24 am

The file server being a kernel module inside the guest? We want to be
able to serve things as early and hassle free as possible, so in this

I don't see why we need a fuse filesystem. We can of course create one
later on. But for now all you need is a user connecting to that socket.


Alex


--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 8:06 am

No, just a daemon.  If it's important enough we can get distributions to 
package it by default, and then it will be hassle free.  If "early 
enough" is also so important, we can get it to start up on initrd.  If 

If the perf app knows the protocol, no problem.  But leave perf with 
pure filesystem access and hide the details in fuse.

-- 
error compiling committee.c: too many arguments to function

--

From: Andi Kleen
Date: Tuesday, March 23, 2010 - 10:09 pm

Agreed. I especially would like to see instruction/branch tracing
working this way.  This would a lot of the benefits of a simulator on
a real CPU.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Avi Kivity
Date: Tuesday, March 23, 2010 - 11:42 pm

If you're profiling a single guest it makes more sense to do this from 
inside the guest - you can profile userspace as well as the kernel.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Andi Kleen
Date: Wednesday, March 24, 2010 - 12:38 am

I'm interested in debugging the guest without guest cooperation.

In many cases qemu's new gdb stub works for that, but in some cases
I would prefer instruction/branch traces over standard gdb style
debugging.

I used to use that very successfully with simulators in the past
for some hard bugs.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Avi Kivity
Date: Wednesday, March 24, 2010 - 1:59 am

Isn't gdb supposed to be able to use branch traces?  It makes sense to 
expose them via the gdb stub then.  Not to say an external tool doesn't 
make sense.


-- 
error compiling committee.c: too many arguments to function

--

From: Andi Kleen
Date: Wednesday, March 24, 2010 - 2:31 am

AFAIK not. The ptrace interface is only used by idb I believe.
I might be wrong on that.

Not sure if there is even a remote protocol command for 
branch traces either.

There's a concept of "tracepoints" in the protocol, but it 

Ok that would work for me too. As long as I can set start/stop
triggers and pipe the log somewhere it's fine for me.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 7:46 am

Sigh, why am I drawn into this.

A person who uses dishonest arguments is a dishonest person.  When you 
say I use a dishonest argument you are implying I am dishonest.  Why do 
you argue with me at all if you think I am trying to cheat?

If you disagree with me, tell me I am wrong, not dishonest (or that my 
arguments are dishonest).  And this is just one example in this thread.  
Seriously, tools/kvm would cause a loss of developers, not a gain, 
simply because of the style of argument of some people on this list.  
Maybe qemu/kernels is a better idea.

Again, if you want to talk to me, use the same language you'd like to 
hear yourself.  Or maybe years of lkml made you so thick skinned you no 
longer understand how to interact with people.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 9:08 am

That's not how i understood that phrase - and i did not mean to suggest that 
you are dishonest and i do not think that you are dishonest (to the contrary).

Thanks,

	Ingo
--

From: Avi Kivity
Date: Monday, March 22, 2010 - 9:13 am

Word games.

-- 
error compiling committee.c: too many arguments to function

--

From: Paolo Bonzini
Date: Wednesday, March 24, 2010 - 5:06 am

This third category is pretty well served by virt-manager.  It has its 
quirks and shortcomings, but at least it exists.

Paolo
--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 3:00 pm

If that is the theory then it has failed to trickle through in practice. As 
you know i have reported a long list of usability problems with hardly a look. 
That list could be created by pretty much anyone spending a few minutes of 
getting a first impression with qemu-kvm.

So something is seriously wrong in KVM land, to pretty much anyone trying it 
for the first time. I have explained how i see the root cause of that, while 
you seem to suggest that there's nothing wrong to begin with. I guess we'll 
have to agree to disagree on that.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Sunday, March 21, 2010 - 4:50 pm

I think the point you're missing is that your list was from the 
perspective of someone looking at a desktop virtualization solution that 
had was graphically oriented.

As Avi has repeatedly mentioned, so far, that has not been the target 
audience of QEMU.  The target audience tends to be: 1) people looking to 
do server virtualization and 2) people looking to do command line based 
development.

Usually, both (1) and (2) are working on machines that are remotely 
located.  What's important to these users is that VMs be easily 
launchable from the command line, that there is a lot of flexibility in 
defining machine types, and that there is a programmatic way to interact 
with a given instance of QEMU.  Those are the things that we've been 
focusing on recently.

The reason we don't have better desktop virtualization support is 
simple.  No one is volunteering to do it and no company is funding 
development for it.

When you look at something like VirtualBox, what you're looking at is a 
long ago forked version of QEMU with a GUI added focusing on desktop 
virtualization.

There is no magic behind adding a better, more usable GUI to QEMU.  It 
just takes resources.

I understand that you're trying to make the point that without catering 
to the desktop virtualization use case, we won't get as many developers 
as we could.  Personally, I don't think that argument is accurate.  If 
you look at VirtualBox, it's performance is terrible.  Having a nice GUI 
hasn't gotten them the type of developers that can improve their 
performance.

No one is arguing that we wouldn't like to have a nicer UI.  I'd love to 
merge any patch like that.

Regards,

Anthony Liguori

--

From: Anthony Liguori
Date: Sunday, March 21, 2010 - 5:25 pm

Can you transfer your list to the following wiki page:

http://wiki.qemu.org/Features/Usability

This thread is so large that I can't find your note that contained the 
initial list.

I want to make sure this input doesn't die once this thread settles down.

Regards,

Anthony Liguori

--

From: Avi Kivity
Date: Monday, March 22, 2010 - 12:18 am

It does happen in practice, just not in the GUI areas, since no one is 
working on them.  I am not going to condition a qcow2 reliability fix to 

Not anyone trying it for the first time.  RHEV-M users will see a 
polished GUI that can be used to manage thousands of guests and hosts.  
I presume IBM and Siemens (and all other contributors) users will also 
enjoy a good user experience with their respective products.  Qemu is 
not the only GUI for kvm.

So far only one company was interested in a qemu GUI - the makers of 
virtualbox.  Unfortunately they chose not to contribute that back to 
qemu, and no one was sufficiently motivated to pick out the bits and try 
to merge them.

Again, if you are interested in a qemu GUI, you either have to write it 
yourself or convince someone else to do it.  My own plate is full and my 
priorities are clear.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 9:11 am

Using DRM doesn't help very much.  You still need an X driver and most 
of the operations you care about (video rendering, window movement, etc) 
are not operations that need to go through DRM.

3D graphics virtualization is extremely difficult in the non-passthrough 
case.  It really requires hardware support that isn't widely available 

It doesn't provide the things we need to a good user experience.  You 
need things like an absolute input device, host driven display resize, 
RGBA hardware cursors.  None of these go through DRI and it's those 

I don't know why you keep saying this.  The people who are in these 
"separate communities" keep claiming that they don't feel this way.

I'm not just saying this to be argumentative.  Many of the people in the 
community have thought this same thing, and tried it themselves, and 
we've all come to the same conclusion.

It's certainly possible that we just missed the obvious thing to do but 

If this is true, please demonstrate it.  Prove your point with patches 

Nah, instead we can just have a few hundred mail thread on the list.  
Otherwise we'd have to write patches and do other kinds of productive work.

Regards,


--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 9:28 am

For the full-screen case (which is a very common mode of using a guest OS on 
the desktop) there's not much of window management needed. You need to 


With KSM the display resize is in the kernel. Cursor management is not. Yet: i 
think it would be a nice feature as the cursor could move even if Xorg is 

If you are not two separate communities but one community, then why do you go 
through the (somewhat masochistic) self-punishing excercise of keeping the 
project in two different pieces?

In a distant past Qemu was a separate project and KVM was just a newcomer who 
used it for fancy stuff. Today as you say(?) the two communities are one and 

I'm not aware of anyone in the past having attempted to move qemu to 
tools/kvm/ in the uptream kernel repo, and having reported on the experiences 
with such a contribution setup. (obviously it's not possible at all without 
heavy cooperation and acceptance from you and Avi, so this will probably 
remain a thought experiment forever)

If then you must refer to previous attempts to 'strip down' Qemu, right? Those 
attempts didnt really solve the fundamental problem of project code base 
separation.

	Ingo
--

From: Paul Mundt
Date: Friday, March 19, 2010 - 2:19 am

Implementing a virtualized DRM/KMS driver would at least get you the
framebuffer interface more or less for free, while allowing you to deal
with the userspace side of things incrementally (ie, running a dummy xorg
on top of the virtualized fbdev until the DRI side catches up). It would
None of these things negate the benefit one would get from a virtualized
DRM/KMS driver either. There are multiple problems that need solving in
this area, and it's a bit disingenuous to discount a valid suggestion out
of hand due to the fact it doesn't solve all of the outstanding issues.
--

From: Olivier Galibert
Date: Friday, March 19, 2010 - 2:52 am

Guys, have a look at Gallium.  In many ways it's a pile of crap, but
at least it's a pile of crap designed by vmware for *exactly* your
problem space.

  OG.
--

From: Konrad Rzeszutek Wilk
Date: Friday, March 19, 2010 - 6:56 am

Or perhaps Chromium, which was designed years ago and can pass-through
OpenGL commands via a pipe. VirtualBox uses it for their PV drivers.
Naturally it is not a FB, just a OpenGL command pass-through interface.
--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 7:53 am

Why does Linux AIO still suck?  Why do we not have a proper interface in 
userspace for doing asynchronous file system operations?

Why don't we have an interface in userspace to do zero-copy transmit and 
receive of raw network packets?

The lack of a decent userspace API for asynchronous file system 
operations is a huge usability problem for us.  Take a look at the 
complexity of our -drive option.  It's all because the kernel gives us 
sucky interfaces.

Regards,

Anthony Liguori
--

From: Avi Kivity
Date: Thursday, March 18, 2010 - 9:54 am

I think you're increasing the height of that wall by arguing that a 
userspace project cannot be successful because it's development process 
sucks and the only way to fix it is to put it into the kernel where 
people know so much better.  Instead we kernel developers should listen 
to requirements from users, even if their code isn't in tools/.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 10:11 am

No, it's tearing down that wall because finally, instead of providing rather 
abstract system calls that are designed perfectly, the kernel can operate by 
providing useful libraries and apps.

At least on the context i've worked on it has torn down walls and has improved 
the efficiency of working on ABIs towards user-space. (sysprof is an example 
of that)

Kernel developers are finally faced with user-space development directly, in 
the same repository, using the same rules of contribution.

Non-kernel-hosted apps win from that process too, as even if they dont 
integrate (because they dont want to or cannot for license reasons) they can 
participate in a more direct (and more practical) exchange with kernel 
developers. They can contribute a new system call and create a library 
function for it straight away.

	Ingo
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 9:13 am

Good that you mention it, i think it's an excellent example.

The suckage of kernel async IO is for similar reasons: there's an ugly package 
separation problem between the kernel and between glibc - and between the apps 
that would make use of it.

( With the separated libaio it was made worse: there were 3 libraries to
  work with, and even less applications that could make use of it ... )

So IMO klibc is an arguably good idea - eventually hpa will get around posting 
it for upstream merging again. Then we could offer both new libraries much 
faster, and could offer things like comprehensive AIO used pervasively within 

If you had your bits in tools/kvm/ you could make a strong case for a good 
kaio implementation - coupled with an actual, working use-case. ( You could 
use the raw syscall even without klibc. )

We could see the arguments on lkml turn from:

   'do we want this and it will take years to propagate this into apps'

into something like:

   ' Exactly how much faster does kvm go? and I'd get is straight away with my
     next kernel update tomorrow? Wow! '

Ok, i exaggerated a bit - but you get the idea. It's a much different picture 
when kernel developers and maintainers see an actual use-case, _right in the 
kernel repo they work with every day_.

Currently there's a wall between kernel developers and user-space developers, 
and there's somewhat of an element of fear and arrogance on both sides. For 
efficient technology such walls needs torn down and people need a bit more 
experience with each other's areas.

	Ingo
--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 11:20 am

And why wouldn't the kernel developers produce posix-aio within klibc.

posix-aio is also a really terrible interface (although not as bad as 
linux-aio).

The reason boils down to the fact that these interfaces are designed 
without interacting with the consumers.  Part of the reason for that is 
the attitude of the community.

You approached this discussion with, "QEMU/KVM sucks, you should move 
into the kernel because we're awesome and we'd fix everything in a heart 
beat".  That attitude does not result in any useful collaboration.

Had you started trying to understand what the problems that we face are 
and whether there's anything that can be done in the kernel to improve 
it, it would have been an entirely different discussion.

The sad thing is, QEMU is probably one of the most demanding free 
software applications out there today with respect to performance.  We 
consume interfaces IO interfaces and things like large pages in a deeper 
way than just about any application out there.

We've been trying for a long time to improve Linux interfaces for years 
but we've not had many people in the kernel community be receptive.

We've failed to improve the userspace networking interfaces.  Compare 
Rusty's posting of vringfd to vhost-net.  They are the same interface 
except we tried to do something more generally useful with vringfd and 
it was shot down because it was "yet another kernel/userspace data 
transfer interface".  Unfortunately, we're learning that if we claim 
something is virtualization specific, we avoid a lot of the kernel 
bureaucracy.  My concern is that over time, we'll have more things like 
vhost and that's bad for everyone.

Regards,

Anthony Liguori

--

From: Gabor Gombas
Date: Sunday, March 21, 2010 - 6:27 am

No, kernel async IO sucks because it still does not play well with
buffered I/O. Last time I checked (about a year ago or so), AIO syscall
latencies were much worse when buffered I/O was used compared to direct
I/O. Unfortunately, to achieve good performance with direct I/O, you
need a HW RAID card with lots of on-board cache.

Gabor
--

From: Zachary Amsden
Date: Thursday, March 18, 2010 - 2:02 pm

Ingo, what you miss is that this is not a bad thing.  Fact of the matter 
is, it's not just painful, it downright sucks.

This is actually a Good Thing (tm).  It means you have to get your 
feature and its interfaces well defined and able to version forwards and 
backwards independently from each other.  And that introduces some 
complexity and time and testing, but in the end it's what you want.  You 
don't introduce a requirement to have the feature, but take advantage of 
it if it is there.

It may take everyone else a couple years to upgrade the compilers, 
tools, libraries and kernel, and by that time any bugs introduced by 
interacting with this feature will have been ironed out and their 
patterns well known.

If you haven't well defined and carefully thought out the feature ahead 
of time, you end up creating a giant mess, possibly the need for nasty 
backwards compatibility (case in point: COMPAT_VDSO).  But in the end, 
you would have made those same mistakes on your internal tree anyway, 
and then you (or likely, some other hapless project maintainer for the 
project you forked) would have to go add the features, fixes and 
workarounds back to the original project(s).  However, since you 
developed in an insulated sheltered environment, those fixes and 
workarounds would not be robust and independently versionable from each 
other.

The result is you've kept your codebase version-neutral, forked in 
outside code, enhanced it, and left the hard work of backporting those 
changes and keeping them version-safe to the original package 
maintainers you forked from.  What you've created is no longer a single 
project, it is called a distro, and you're being short-sighted and 
anti-social to think you can garner more support than all of those 
individual packages you forked.  This is why most developers work 
upstream and let the goodness propagate down from the top like molten 
sugar of each granular package on a flan where it is collected from the 
rich custard ...
From: Ingo Molnar
Date: Thursday, March 18, 2010 - 2:15 pm

Our experience is the opposite, and we tried both variants and report about 
our experience with both models honestly.

You only have experience about one variant - the one you advocate.


Sorry, but this is pain not true. The 2.4->2.6 kernel cycle debacle has taught 
us that waiting long to 'iron out' the details has the following effects:

 - developer pain
 - user pain
 - distro pain
 - disconnect
 - loss of developers, testers and users
 - grave bugs discovered months (years ...) down the line
 - untested features
 - developer exhaustion

It didnt work, trust me - and i've been around long enough to have suffered 
through the whole 2.5.x misery. Some of our worst ABIs come from that cycle as 
well.

So we first created the 2.6.x process, then as we saw that it worked much 
better we _sped up_ the kernel development process some more, to what many 
claimed was an impossible, crazy pace: two weeks merge window, 2.5 months 
stabilization and a stable release every 3 months.

And you can also see the countless examples of carefully drafted, well thought 
out, committee written computer standards that were honed for years, which are 
not worth the paper they are written on.

'extra time' and 'extra buerocratic overhead to think things through' is about 
the worst thing you can inject into a development process.

You should think about the human brain as a cache - the 'closer' things are 
both in time and pyshically, the better they end up being. Also, the more 
gradual, the more concentrated a thing is, the better it works out in general. 
This is part of the basic human nature.

Sorry, but i really think you are really trying to rationalize a disadvantage 
here ...

	Ingo
--

From: Zachary Amsden
Date: Thursday, March 18, 2010 - 3:19 pm

You're talking about a single project and comparing it to my argument 
about multiple independent projects.  In that case, I see no point in 
the discussion.  If you want to win the argument by strawman, you are 

This could very well be true, but until someone comes forward with 
compelling numbers (as in, developers committed to working on the 
project, number of patches and total amount of code contribution), there 
is no point in having an argument, there really isn't anything to 
discuss other than opinion.  My opinion is you need a really strong 
justification to have a successful fork and I don't see that justification.

Zach
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 3:44 pm

The kernel is a very complex project with many ABI issues, so all those 
arguments apply to it as well. The description you gave:

 | This is actually a Good Thing (tm).  It means you have to get your feature 
 | and its interfaces well defined and able to version forwards and backwards 
 | independently from each other.  And that introduces some complexity and 
 | time and testing, but in the end it's what you want.  You don't introduce a 
 | requirement to have the feature, but take advantage of it if it is there.

matches the kernel too. We have many such situations. (Furthermore, the 
tools/perf/ situation, which relates to ABIs and user-space/kernel-space 
interactions is similar as well.)


I can give you rough numbers for tools/perf - if that counts for you.

For the first four months of its existence, when it was a separate project, i 
had a single external contributor IIRC.

The moment it went into the kernel repo the number of contributors and 
contributions skyrocketed and basically all contributions were top-notch. We 
are at 60+ separate contributors now (after about 8 months upstream) - which 
is still small compared to the kernel or to Qemu, but huge for a relatively 
isolated project like instrumentation.

So in my estimation tools/kvm/ would certainly be popular. Whether it would be 
more popular than current Qemu is hard to tell - it would be pure speculation.

Any reliable numbers for the other aspect, whether a split project creates a 
more fragile and less developed ABI would be extremely hard to get. I believe 
it to be true, but that's my opinion based on my experience with other 
projects, extrapolated to KVM/Qemu.

Anyway, the issue is moot as there's clear opposition to the unification idea. 

Too bad - there was heavy initial opposition to the arch/x86 unification as 
well [and heavy opposition to tools/perf/ as well], still both worked out 
extremely well :-)

	Ingo
--

From: Avi Kivity
Date: Friday, March 19, 2010 - 12:21 am

Did you forget that arch/x86 was a merging of a code fork that happened 
several years previously?  Maybe that fork shouldn't have been done to 
begin with.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Andrea Arcangeli
Date: Saturday, March 20, 2010 - 7:59 am

We discussed and probably timidly tried to share the sharable
initially but we realized it was too time wasteful. In addition to
having to adapt the code to 64bit we would also had to constantly
solve another problem on top of it (see the various split on _32/_64,
those takes time to achieve, maybe not huge time but still definitely
some time and effort). Even in retrospect I am quite sure the way
x86-64 happened was optimal and if we would go back we would do it
again the exact same way even if the final object was to have a common
arch/x86 (and thankfully Linus is flexible and smart enough to realize
that code that isn't risking to destabilize anything shouldn't be
forced out just because it's not to a totally
theoretical-perfect-nitpicking-clean-state yet). It's still a lot of
work do the unification later as a separate task, but it's not like if
we did it immediately it would have been a lot less work. It's about
the same amount of effort and we were able to defer it for later and
decrease the time to market which surely has contributed to the
success of x86-64.

Problem of qemu is not some lack of GUI or that it's not included in
the linux kernel git tree, the definitive problem is how to merge
qemu-kvm/kvm and qlx into it. If you (Avi) were the qemu maintainer I
am sure there wouldn't two trees so as a developer I would totally
love it, and I am sure that with you as maintainer it would have a
chance to move forward with qlx on desktop virtualization without
proposing to extend vnc instead to achieve a "similar" result (imagine
if btrfs is published on a website and people starts to discuss if it
should ever be merged ever because reinventing some part of btrfs
inside ext5 might achieve ""similar"" results).

About a GUI for KVM to use on desktop distributions, that is an
irrelevant concern compared to the lack of protocol more efficient
than rdesktop/rdp/vnc for desktop virtualization. I've people asking
me to migrate hundreds of desktops to desktop virtualization on ...
From: Avi Kivity
Date: Sunday, March 21, 2010 - 3:03 am

In hindsight decisions are much easier.  I agree it was less risky to 
fork than to share.  But if another instruction set forks out a 64-bit 
not-exactly-compatible variant, I'm sure we'll start out shared and not 

The qemu/qemu-kvm fork is definitely hurting.  Some history: when kvm 
started out I pulled qemu for fast hacking and, much like arch/x86_64, I 
couldn't destabilize qemu for something that was completely experimental 
(and closed source at the time).  Moreover, it wasn't clear if the qemu 
community would be interested.

The qemu-kvm fork was designed for minimal intrusion so I could merge 
upstream qemu regularly.  This resulted in kvm integration that was 
fairly ugly.  Later Anthony merged a well-integrated alternative 
implementation (in retrospect this was a mistake IMO - we were left with 
a well tested high performing ugly implementation and a clean, slow, 
untested, and unfeatured implementation, and no one who wants to merge 
the two).  So now it is pretty confusing to read the code which has the 


Anyone can focus on what interests them, if someone has an interest in a 
good desktop-on-desktop experience they should start hacking and sending 
patches.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 2:22 am

To the contrary, experience shows that repository location, and in particular 
a shared repository for closely related bits is very much material!

It matters because when there are two separate projects, even a "serious 
developer" is finding it double and triple difficult to contribute even 
trivial changes.

It becomes literally a nightmare if you have to touch 3 packages: kernel, a 
library and an app codebase. It takes _forever_ to get anything useful done.

Also, 'focus on a single thing' is a very basic aspect of humans, especially 
those who do computer programming. Working on two code bases in two 
repositories at once can be very challenging physically and psychically.

So what i've seen is that OSS programmers tend to pick a side, pretty much 
randomly, and then rationalize it in hindsight why they prefer that side ;-)

Most of them become either a kernel developer or a user-space package 
developer - and then they specialize on that field and shy away from changes 
that involve both. It's a basic human thing to avoid the hassle that comes 
with multi-package changes. (One really has to be outright stupid, fanatic or 
desperate to even attempt such changes these days - such are the difficulties 
for a comparatively low return.)

The solution is to tear down such artificial walls of contribution where 
possible. And tearing down the wall between KVM and qemu-kvm seems very much 
possible and the advantages would be numerous.

Unless by "serious developer" you meant: "developer willing to [or forced to] 

Then you'll be surprised to hear that it's happening as we speak and the 
commits are there in linux-2.6.git. Both a TUI and GUI is in the works.

Furthermore, the numbers show that half of the usability fixes to tools/perf/ 
came not from regular perf contributors but from random kernel developers and 
testers who when they build the latest kernel and try out perf at the same 
time (it's very easy because you already have it in the kernel repository - no ...
From: Avi Kivity
Date: Thursday, March 18, 2010 - 3:32 am

You can't be serious.  I find that the difficulty in contributing a 
patch has mostly to do with writing the patch, and less with figuring 

Indeed, working simultaneously on two different projects is difficult.  
I usually work for a while on one, and then 'cd', physically and 
psychically, to the other.  Then switch back.  Sort of like the 

We have a large number of such stupid, fanatic, desperate developers in 

By "serious developer" I mean

  - someone who is interested in contributing, not in getting their name 
into the kernel commits list
  - someone who is willing to read the wiki page and find out where the 
repository and mailing list for a project is
  - someone who will spend enough time on the project so that the time 
to clone two repositories will not be a factor in their contributions
  - someone who will work on the uncool stuff like fixing bugs and 

Let's wait and see then.  If the tools/perf/ experience has really good 
results, we can reconsider this at a later date.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 4:19 am

My own experience and everyone i've talked about such topics (developers and 
distro people) about feature contribution tells the exact opposite: it's much 
harder to contribute features to multiple packages than to a single project.

kernel+library+app features take forever to propagate, and there's constant 
fear of version friction, productization deadlines are uncertain and ABI 
messups are frequent as well due to disjoint testing. Also, each component has 
essential veto power: so if the proposed API or approach is opposed or changed 
in a later stage then that affects (sometimes already committed) changes. If 
you've ever done it you'll know how tedious it is.

This very thread and recent threads about KVM usability demonstrate the same 
complications.

Thanks,

	Ingo
--

From: Frederic Weisbecker
Date: Thursday, March 18, 2010 - 11:20 am

I'm not going to argue about the Qemu merging here.
But your above assessment is incomplete.

It is not because developers don't want to clone two different
trees that tools/perf is a success. Or may be it's a factor but
I suspect it to be very minimal. I can script git commands if
needed. It is actually because both kernel and user side are


I think it has already really good results.

--

From: Frank Ch. Eigler
Date: Thursday, March 18, 2010 - 12:50 pm

This argues that co-evolution of an interface is easiest on the
developers if they own both sides of that interface.  No quarrel.

This does not argue that that the preservation of a stable ABI is best
done this way.  If anything, it makes it too easy to change both the
provider and the preferred user of the interface without noticing
unintentional breakage to forlorn out-of-your-tree clients.


- FChE
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 1:47 pm

Your concern is valid, and this issue has been raised in the past as one of 
the main counter-arguments against tools/perf/. (there was a big flamewar 
about it on lkml when it was introduced)

Our roughly 1 year experience with perf is that, somewhat pradoxially, this 
scheme not only works as well as classic ABI schemes but actually brings a 
_better_ ABI than the classic "let the kernel define an ABI" single-sided 
solution.

I know the difference first hand, i've written various syscalls ABIs in the 
past 10+ years before perf and know how they interact with their user space 
counterparts.

Why did it work out better with tools/perf/? It turns out that there's an 
immediate, direct, actionable test feedback effect on the ABI, and much closer 
relation to the ABI. Typically the same developer implements the kernel bits 
and the user-space bits (because it's so easy to do co-development), so the 
ABI aspects are ingrained in the developer much more deeply. Once you see the 
kind of havoc ABI breakage can cause during development you avoid it in the 
future.

So developers find that a good, stable ABI helps development. It turns out 
that developers dont actually _want_ to break the ABI and are careful about it 
- and having the app next to the kernel ABI and co-developing it makes it sure 
there's never any true mismatch.

Also, we can do ABI improvements at a far higher rate than any other kernel 
subsystem. I checked the git logs, we've done over three dozen ABI extensions 
since the first version, and all were forwards _and_ backwards compatible.

A higher rate of change gives developers more experience and lets them do a 
better ABI, and makes them more ABI-conscious. I think if all kernel ABIs had 
such a healthy rate of change we'd fill in all the missing kernel features 
very quickly.

With detached packages ABI features are often done by a kernel developer (who 
is familar with the kernel subsystem in question) and a separate user-space 
developer (who is ...
From: Jes Sorensen
Date: Thursday, March 18, 2010 - 1:44 am

Ingo,

What made KVM so successful was that the core kernel of the hypervisor
was designed the right way, as a kernel module where it belonged. It was
obvious to anyone who had been exposed to the main competition at the
time, Xen, that this was the right approach. What has ended up killing
Xen in the end is the not-invented-here approach of copying everything
over, reformatting it, and rewriting half of it, which made it
impossible to maintain and support as a single codebase. At my previous
employer we ended up dropping all Xen efforts exactly because it was
like maintaining two separate operating system kernels. The key to KVM

Well there are two ways to go about this. Either you base the KVM
userland on top of an existing project, like QEMU, _or_ you rewrite it
all from scratch. However, there is far more to it than just a couple of
ioctls, for example the stack of reverse device-drivers is a pretty
significant code base, rewriting that and maintaining it is not a
trivial task. It is certainly my belief that the benefit we get from
sharing that with QEMU by far outweighs the cost of forking it and
keeping our own fork in the kernel tree. In fact it would result in

With this you have just thrown away all the benefits of having the QEMU
repository shared with other developers who will actively fix bugs in


Now that would be interesting, next we'll have to include things like
libxml in the kernel git tree as well, to make sure libvirt doesn't get

So far your argument would justify pulling all of gdb into the kernel
git tree as well, to support the kgdb efforts, or gcc so we can get rid
of the gcc version quirks in the kernel header files, e2fsprogs and
equivalent for _all_ file systems included in the kernel so we can make
sure our fs tools never get out of sync with whats supported in the

The user components for perf vs oprofile are _tiny_ projects compared to
the portions of QEMU that are actually used by KVM.

Oh and you completely forgot SeaBIOS. KVM+QEMU rely ...
From: Ingo Molnar
Date: Thursday, March 18, 2010 - 2:54 am

Yes, exactly.


Yes. Please realize that what is behind it is a strikingly simple argument:


Btw., i made similar arguments to Avi about 3 years ago when it was going 
upstream, that qemu should be unified with KVM. This is more true today than 

I do not suggest forking Qemu at all, i suggest using the most natural 


My experience as an external observer of the end result contradicts this.

Seemingly trivial usability changes to the KVM+Qemu combo are not being done 
often because they involve cross-discipline changes.

( _In this very thread_ there has been a somewhat self-defeating argument by 
  Anthony that multi-package scenario would 'significantly complicate' 
  matters. What more proof do we need to state the obvious? Keeping what
  has become one piece of technology over the years in two separate halves is

The way we have gone about this in tools/perf/ is similar to the route picked 
by Git: we only use very lowlevel libraries available everywhere, and we 
provide optional wrappers to the rest.

We are also using the kernel's libraries so we rarely need to go outside to 
get some functionality.

I.e. it's a non-issue in practice and despite perf having an (optional) 
dependency on xmlto and docbook we dont include those packages nor do we force 

gdb and gcc is clearly extrinsic to the kernel so why would we move them 
there?

I was talking about tools that are closely related to the kernel - where much 
of the development and actual use is in combination with the Linux kernel.

90%+ of the Qemu usecases are combined with Linux. (Yes, i know that you can 
run Qemu without KVM, and no, i dont think it matters in the grand scheme of 
things and most investment into Qemu comes from the KVM angle these days. In 
particular it for sure does not justify handicapping future KVM evolution so 

I know the size and scope of Qemu, i even hacked it - still my points remain. 

SeaBIOS is in essence a firmware, so it could either be loaded as such.

Just look ...
From: Jes Sorensen
Date: Thursday, March 18, 2010 - 3:40 am

Thats a very glorified statement but it's not reality, sorry. You can do
that with something like perf because it's so small and development of

If you are not suggesting to fork QEMU, what are you suggesting then?
You don't seriously expect that the KVM community will be able to
mandate that the QEMU community switch to the Linux kernel repository?
That would be like telling the openssl developers that they should merge
with glibc and start working out of the glibc tree.

What you are suggesting is *only* going to happen if we fork QEMU, there
is zero chance to move the main QEMU repository into the Linux kernel
tree. And trust me, you don't want to have Linus having to deal with

You still haven't explained how you expect create a unified KVM+QEMU

What I have seen you complain about here is the lack of a good end user
GUI for KVM. However that is a different thing. So far no vendor has put
significant effort into it, but that is nothing new in Linux. We have a
great kernel, but our user applications are still lacking. We have 217
CD players for GNOME, but we have no usable calendering application.

A good GUI for virtualization is a big task, and whoever designs it will
base their design upon their preferences for whats important. A lot of
spare time developers would clearly care most about a gui installation
and fancy icons to click on, whereas server users would be much more
interested in automation and remote access to the systems. For a good
example of an incomplete solution, try installing Fedora over a serial
line, you cannot do half the things without launching VNC :( Getting a
comprehensive solution for this that would satisfy the bulk of the users
would be a huge chunk of code in the kernel tree. Imagine the screaming
that would result in? How often have we not had the moaning from x86
users who wanted to rip out all the non x86 code to reduce the size of


Did you ever look at what libvirt actually does and what it offers? Or
how about the various libraries ...
From: Ingo Molnar
Date: Thursday, March 18, 2010 - 3:58 am

I was not talking about just perf: i am also talking about the arch/x86/ 
unification which is 200+ KLOC of highly non-trivial kernel code with hundreds 
of contributors and with 8000+ commits in the past two years.

Also, it applies to perf as well: people said exactly that a year ago: 'perf 
has it easy to be clean as it is small, once it gets as large as Oprofile 
tooling it will be in the same messy situation'.

Today perf has more features than Oprofile, has a larger and more complex code 
base, has more contributors, and no, it's not in the same messy situation at 
all.

So whatever you think of large, unified projects, you are quite clearly 
mistaken. I have done and maintained through two different types of 
unifications and the experience was very similar: both developers and users 
(and maintainers) are much better off.

	Ingo
--

From: Jes Sorensen
Date: Thursday, March 18, 2010 - 6:23 am

Sorry but you cannot compare merging two chunks of kernel code that
originated from the same base, with the efforts of mixing a large

Both perf and oprofile are still relatively small projects in comparison

You believe that I am wrong in my assessment of unified projects, and I
obviously think you are mistaken and underestimating the cost and
effects of trying to merge the two.

Well I think we are just going to agree to disagree on this one. I am
not against merging projects where it makes sense, but in this
particular case I am strongly convinced the loss would be much greater
than the gain.

Cheers,
Jes
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 7:22 am

That's true to a certain degree, but combined with the perf experience it's 
all rather clear.

Similar arguments were made against the x86 unification and against perf. 
Similar arguments were made against KVM and in favor of Xen years ago - back 
when few of you knew about it ;-)

These are all repeating patterns in my experience.

You could fairly contrast that with a _failed_ unification perhaps - but i'm 
not aware of any such failed unification. (please educate me if you are)

The thing is, unifications are rare in the OSS space not because they dont 
make sense technically (to the contrary), they are rare due to blind inertia 
(why change if we managed to muddle through with the current scheme?) and to a 
certain degree due to the egos involved ;-)

As such we have a proliferation of packages in Linux, and we'd be much better 
off in a more focused fashion. And whenever i see that in the kernel's context 

So is your argument that the unification does not make sense due to size? 

I wish you said that based on first hand negative experience with 
unifications, not based on just pure speculation.

(and yes, i speculate too, but at least with some basis)

	Ingo
--

From: Jes Sorensen
Date: Thursday, March 18, 2010 - 7:45 am

As I have stated repeatedly in this discussion, a unification would hurt
the QEMU development process because it would alienate a large number of
QEMU developers who are *not* Linux kernel users.


You still haven't given us a *single* example of unification of
something that wasn't purely linked to the Linux kernel. perf/
oprofile is 100% linked to the Linux kernel, QEMU is not. I wish
you would actually look at what users use QEMU for. As long as you
continue to purely speculate on this, to use your own words, your
arguments are not holding up.

And you are not being consistent either. You have conveniently
continue to ignore my questions about why the file system tools are not
to be merged into the Linux kernel source tree?

Jes
--

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 9:54 am

I took a quick look at the qemu.git log and more than half of all recent 
contributions came from Linux distributors.

So without KVM Qemu would be a much, much smaller project. It would be similar 


The stats show that the huge increase in Qemu contributions over the past few 
years was mainly due to KVM. Do you claim it wasnt? What other projects make 

Sorry, i didnt comment on it because the answer is obvious: the file system 
tools and pretty much any Linux-exclusive tool (such as udev) should be moved 
there. The difference is that there's not much active development done in most 
of those tools so the benefits are probably marginal. Both Qemu and KVM is 
being developed very actively though, so development model inefficiencies show 
up.

Anyway, i didnt think i'd step into such a hornet's nest by explaining what i 
see as KVM's biggest weakness today and how i suggest it to be fixed :-)

If you dont agree with me, then dont do it - no need to get emotional about 
it.

Thanks,

	Ingo
--

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 11:10 am

I don't know what you're looking at, but in the past month, there's been 
56 unique contributors, with 411 changesets.  I count 16 people employed 

I'm not saying that KVM isn't significant.  I'm employed to work on QEMU 
because of KVM.

I'm just saying that KVM users aren't 99% of the community and that we 
can't neglect the rest of the community.

Regards,

Anthony Liguori
--

From: Andrea Arcangeli
Date: Friday, March 19, 2010 - 7:53 am

Hi there,

not really trying to get into the CC list of this discussion ;) but
for what is worth I'd like to share my opinion on the matter.


Full agreement with that. CVS/git/patches and development model is
next to irrelevant compared to the basic design of the code.

qemu (and especially qemu-kvm) is surely much closer to perf, than a
firefox or openoffice, because there is some tight interconnect with
the kernel API. And the skills required to produce useful patches in
qemu are similar to the skills requires to produce useful patches for
the kernel, more often than not a new feature in kvm also requires
some merging of a qemu-kvm side patch (it always happened to me so far
;). But clearly we've to draw a barrier somewhere and while I could
see things like systemtap and util-linux included into the kernel and
perf already is, I've an hard time to see userland code supporting
kernels other than linux into the kernel.

I think that's probably where I'd draw the line. Let's say somebody
creates a pure paravirt userland for kvm without full driver emulation
that only runs on a linux kernel and no other OS, maybe that thing
wouldn't be so controversial to include into the kernel as qemu
is. qemu is clearly beyond the "only-running-on-a-linux-kernel"
barrier...

I'd definitely start with systemtap, which I think is even more
suitable than perf to be merged into the kernel. Things useful only
for developers like perf/systemtap makes even more sense to fetch
silently hidden in a single pull. Those projects are so ideal to fetch
together because you run your own compiled userland binary and not an
rpm, and you need very latest kernel and userland package and sometime
new userland might not work so well with older kernel too and the
other way around. they're tool for developers and no developer cares
about API as they rebuild latest userland code anyway, they almost

It also boils down to the maintainer, where the code is, defines the
maintainer who pushes/commits it to the ...
From: Anthony Liguori
Date: Thursday, March 18, 2010 - 7:38 am

Ok.  Then apply this to the kernel.  I'm then happy to take patches.

Regards,

Anthony Liguori

From: Anthony Liguori
Date: Thursday, March 18, 2010 - 7:44 am

QEMU is about 600k LOC.  We have a mechanism to compile out portions of 
the code but a lot things are tied together in an intimate way.  In the 
long run, we're working on adding stronger interfaces such that we can 
split components out into libraries that are consumable by other 
applications.

Simplying forking the device model won't work.  Well more than half of 
our contributors are not coming from KVM developers/users.  If you just 
fork the device models, you start to lose a ton of fixes (look at Xen 
and VirtualBox).

So feel free to either 1) apply my previous patch and then start working 
on a "clean (and minimal)" QEMU or 2) wait to commit my previous patch 
and start sending patches to clean up QEMU.

Absolute none of this is going to give you a VirtualBox like GUI for QEMU.

Regards,

Anthony Liguori
--

From: oerg Roedel
Date: Tuesday, March 16, 2010 - 3:30 pm

Since we want to implement a pmu usable for the guest anyway why we
don't just use a guests perf to get all information we want? If we get a
pmu-nmi from the guest we just re-inject it to the guest and perf in the
guest gives us all information we wand including kernel and userspace
symbols, stack traces, and so on.

In the previous thread we discussed about a direct trace channel between
guest and host kernel (which can be used for ftrace events for example).
This channel could be used to transport this information to the host
kernel.

The only additional feature needed is a way for the host to start a perf
instance in the guest.

Opinions?


	Joerg

--

From: Masami Hiramatsu
Date: Tuesday, March 16, 2010 - 4:01 pm

I guess this aims to get information from old environments running on

Interesting! I know the people who are trying to do that with systemtap.

# ssh localguest perf record --host-chanel ... ? B-)


-- 
Masami Hiramatsu
e-mail: mhiramat@redhat.com
--

From: Ingo Molnar
Date: Wednesday, March 17, 2010 - 12:27 am

Look at the previous posting of this patch, this is something new and rather 
unique. The main power in the 'perf kvm' kind of instrumentation is to profile 
_both_ the host and the guest on the host, using the same tool (often using 
the same kernel) and using similar workloads, and do profile comparisons using 
'perf diff'.

Note that KVM's in-kernel design makes it easy to offer this kind of 
host/guest shared implementation that Yanmin has created. Other virtulization 
solutions with a poorer design (for example where the hypervisor code base is 
split away from the guest implementation) will have it much harder to create 
something similar.

That kind of integrated approach can result in very interesting finds straight 
away, see:

  http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html

( the profile there demoes the need for spinlock accelerators for example - 
  there's clearly assymetrically large overhead in guest spinlock code. Guess 
  how much else we'll be able to find with a full 'perf kvm' implementation. )

One of the main goals of a virtualization implementation is to eliminate as 
many performance differences to the host kernel as possible. From the first 
day KVM was released the overriding question from users was always: 'how much 
slower is it than native, and which workloads are hit worst, and why, and 
could you pretty please speed up important workload XYZ'.

'perf kvm' helps exactly that kind of development workflow.

Note that with oprofile you can already do separate guest space and host space 
profiling (with the timer driven fallbackin the guest). One idea with 'perf 
kvm' is to change that paradigm of forced separation and forced duplication 
and to supprt the workflow that most developers employ: use the host space for 
development and unify instrumentation in an intuitive framework. Yanmin's 
'perf kvm' patch is a very good step towards that goal.

Anyway ... look at the patches, try them and see it for yourself. Back in the ...
From: Zhang, Yanmin
Date: Tuesday, March 16, 2010 - 12:48 am

With the patch, 'perf kvm report --sort pid" could show
summary statistics for all guest os instances. Then, use
Right, but there is a scope between kvm_guest_enter and really running
in guest os, where a perf event might overflow. Anyway, the scope is very
Right. I discussed with Yangsheng. I will move above data structures and
callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to
kvm_x86_ops.

Yanmin


--

From: Zhang, Yanmin
Date: Tuesday, March 16, 2010 - 2:28 am

Sorry. I found currently --pid isn't process but a thread (main thread).

Ingo,

Is it possible to support a new parameter or extend --inherit, so 'perf record' and
'perf top' could collect data on all threads of a process when the process is running?

If not, I need add a new ugly parameter which is similar to --pid to filter out process
data in userspace.

Yanmin


--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 2:33 am

That seems like a worthwhile addition regardless of this thread.  
Profile all current threads and any new ones.  It probably makes sense 
to call this --pid and rename the existing --pid to --thread.

-- 
error compiling committee.c: too many arguments to function

--

From: Ingo Molnar
Date: Tuesday, March 16, 2010 - 2:47 am

Yeah. For maximum utility i'd suggest to extend --pid to include this, and 
introduce --tid for the previous, limited-to-a-single-task functionality.

Most users would expect --pid to work like a 'late attach' - i.e. to work like 
strace -f or like a gdb attach.

	Ingo
--

From: Zhang, Yanmin
Date: Wednesday, March 17, 2010 - 2:26 am

Thanks Ingo, Avi.

I worked out below patch against tip/master of March 15th.

Subject: [PATCH] Change perf's parameter --pid to process-wide collection
From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>

Change parameter -p (--pid) to real process pid and add -t (--tid) meaning
thread id. Now, --pid means perf collects the statistics of all threads of
the process, while --tid means perf just collect the statistics of that thread.

BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures
attr->disabled=1 if it isn't a system-wide collection. If there is a '-p'
and no forks, 'perf stat -p' doesn't collect any data. In addition, the
while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact
on running workload. I added a sleep(1) in the loop.

Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>

---

diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c
--- linux-2.6_tipmaster0315/tools/perf/builtin-record.c	2010-03-16 08:59:54.896488489 +0800
+++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c	2010-03-17 16:30:17.755551706 +0800
@@ -27,7 +27,7 @@
 #include <unistd.h>
 #include <sched.h>
 
-static int			fd[MAX_NR_CPUS][MAX_COUNTERS];
+static int			*fd[MAX_NR_CPUS][MAX_COUNTERS];
 
 static long			default_interval		=      0;
 
@@ -43,6 +43,9 @@ static int			raw_samples			=      0;
 static int			system_wide			=      0;
 static int			profile_cpu			=     -1;
 static pid_t			target_pid			=     -1;
+static pid_t			target_tid			=     -1;
+static int			*all_tids			=      NULL;
+static int			thread_num			=      0;
 static pid_t			child_pid			=     -1;
 static int			inherit				=      1;
 static int			force				=      0;
@@ -60,7 +63,7 @@ static struct timeval		this_read;
 
 static u64			bytes_written			=      0;
 
-static struct pollfd		event_array[MAX_NR_CPUS * MAX_COUNTERS];
+static struct pollfd		*event_array;
 
 static int			nr_poll				=    ...
From: Zhang, Yanmin
Date: Wednesday, March 17, 2010 - 7:45 pm

Ingo,

Sorry, the patch has bugs.  I need do a better job and will work out 2
separate patches against the 2 issues.

Yanmin


--

From: Zhang, Yanmin
Date: Thursday, March 18, 2010 - 12:49 am

I worked out 3 new patches against tip/master tree of Mar. 17th.

1) Patch perf_stat: Fix the issue that perf doesn't enable counters when
target_pid != -1. Change the condition to fork/exec subcommand. If there
is a subcommand parameter, perf always fork/exec it. The usage example is:
#perf stat -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand,
it should enable all counters after the new process is execing.Change the
condition to fork/exec subcommand. If there is a subcommand parameter,
perf always fork/exec it. The usage example is:
#perf record -f -a sleep 10
So this command could collect statistics for 10 seconds precisely. User
still could stop it by CTRL+C.

3) perf_pid: Change parameter --pid to process-wide collection. Add --tid
which means collecting thread-wide statistics. Usage example is:
#perf top -p 8888
#perf record -p 8888 -f sleep 10
#perf stat -p 8888 -f sleep 10

Arnaldo,

Pls. apply the 3 attached patches.

Yanmin

From: Ingo Molnar
Date: Thursday, March 18, 2010 - 1:03 am

Cool! Mind sending them as a series of patches instead of attachment? That 
makes it easier to review them. Also, the Signed-off-by lines seem to be 
missing plus we need a per patch changelog as well.

Thanks,

	Ingo
--

From: Arnaldo Carvalho de Melo
Date: Thursday, March 18, 2010 - 6:03 am

Yeah, please, and I hadn't merged them, so the resend was the best thing to do.

- Arnaldo
--

From: Avi Kivity
Date: Tuesday, March 16, 2010 - 2:32 am

That certainly works, though automatic association of guest data with 

There is also a window between setting the flag and calling 'int $2' 
where an NMI might happen and be accounted incorrectly.

Perhaps separate the 'int $2' into a direct call into perf and another 
call for the rest of NMI handling.  I don't see how it would work on svm 
though - AFAICT the NMI is held whereas vmx swallows it.  I guess NMIs 

You will need access to the vcpu pointer (kvm_rip_read() needs it), you 
can put it in a percpu variable.  I guess if it's not null, you know 
you're in a guest, so no need for PF_VCPU.

-- 
error compiling committee.c: too many arguments to function

--

From: Zhang, Yanmin
Date: Tuesday, March 16, 2010 - 7:34 pm

Thanks. Originally, I planed to add a -G parameter to perf. Such like
-G 8888:/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules
8888 and 8889 are just qemu guest pid.

So we could define multiple guest os symbol files. But it seems ugly,
and 'perf kvm report --sort pid" and 'perf kvm top --pid' could provide
I'm not sure if vmexit does break NMI context or not. Hardware NMI context
Good suggestion.

Thanks.


--

From: Sheng Yang
Date: Wednesday, March 17, 2010 - 2:28 am

After more check, I think VMX won't remained NMI block state for host. That's 
means, if NMI happened and processor is in VMX non-root mode, it would only 
result in VMExit, with a reason indicate that it's due to NMI happened, but no 
more state change in the host.

So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. 
Moreover, I think we _can't_ stop the re-entrance of NMI handling code because 
"int $2" don't have effect to block following NMI.

And if the NMI sequence is not important(I think so), then we need to generate 
a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to 
itself is a good idea.

I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to replace 
"int $2". Something unexpected is happening...

-- 
regards
Yang, Sheng
--

From: Avi Kivity
Date: Wednesday, March 17, 2010 - 2:41 am

That's pretty bad, as NMI runs on a separate stack (via IST).  So if 
another NMI happens while our int $2 is running, the stack will be 

I think you need DM_NMI for that to work correctly.

An alternative is to call the NMI handler directly.

-- 
error compiling committee.c: too many arguments to function

--

From: Sheng Yang
Date: Wednesday, March 17, 2010 - 2:51 am

Though hardware didn't provide this kind of block, software at least would 
warn about it... nmi_enter() still would be executed by "int $2", and result 
in BUG() if we are already in NMI context(OK, it is a little better than 

apic_send_IPI_self() already took care of APIC_DM_NMI.

And NMI handler would block the following NMI?

-- 
regards
Yang, Sheng
--

From: Avi Kivity
Date: Wednesday, March 17, 2010 - 3:06 am

It wouldn't - won't work without extensive changes.

-- 
error compiling committee.c: too many arguments to function

--

From: Zachary Amsden
Date: Wednesday, March 17, 2010 - 2:14 pm

You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't 
supposed to be able to.

Zach
--

From: Sheng Yang
Date: Wednesday, March 17, 2010 - 6:19 pm

Um? Why?

Especially kernel is already using it to deliver NMI.

-- 
regards
Yang, Sheng
--

From: Zachary Amsden
Date: Wednesday, March 17, 2010 - 9:50 pm

That's the only defined case, and it is defined because the vector field 
is ignore for DM_NMI.  Vol 3A (exact section numbers may vary depending 
on your version).

8.5.1 / 8.6.1

'100 (NMI) Delivers an NMI interrupt to the target processor or 
processors.  The vector information is ignored'

8.5.2  Valid Interrupt Vectors

'Local and I/O APICs support 240 of these vectors (in the range of 16 to 
255) as valid interrupts.'

8.8.4 Interrupt Acceptance for Fixed Interrupts

'...; vectors 0 through 15 are reserved by the APIC (see also: Section 
8.5.2, "Valid Interrupt Vectors")'

So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but 
vectors 0x00-0x0f are not valid to send via APIC or I/O APIC.

Zach
--

From: Sheng Yang
Date: Wednesday, March 17, 2010 - 10:22 pm

As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI, it 
would need a specific delivery mode rather than vector number. 

And if you look at code, if we specific NMI_VECTOR, the delivery mode would be 
set to NMI.

So what's wrong here?

-- 
regards
Yang, Sheng
--

From: Sheng Yang
Date: Wednesday, March 17, 2010 - 10:41 pm

OK, I think I understand your points now. You meant that these vectors can't 
be filled in vector field directly, right? But NMI is a exception due to 
DM_NMI. Is that your point? I think we agree on this.

-- 
regards
Yang, Sheng
--

From: Zachary Amsden
Date: Thursday, March 18, 2010 - 1:47 am

Yes, I think we agree.  NMI is the only vector in 0x0-0xf which can be 
sent via self-IPI because the vector itself does not matter for NMI.

Zach
--

From: Zhang, Yanmin
Date: Thursday, March 18, 2010 - 8:38 pm

Here is the new patch of V2 against tip/master of March 17th
if anyone wants to try it.


ChangeLog V2:
	1) Based on Avi's suggestion, I moved callback functions
	to generic code area. So the kernel part of the patch is
	clearer.
	2) Add 'perf kvm stat'.


From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new subcommand kvm to perf.

  perf kvm top
  perf kvm record
  perf kvm report
  perf kvm diff
  perf kvm stat

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

--------------------------------------------------------------------------------------------------------------------------
   PerfTop:   16010 irqs/sec  kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ _______________________

            38770.00 20.4% __ticket_spin_lock        [guest.kernel.kallsyms]
            22560.00 11.9% ftrace_likely_update      [kernel.kallsyms]
             9208.00  4.8% __lock_acquire            [kernel.kallsyms]
             5473.00  2.9% trace_hardirqs_off_caller [kernel.kallsyms]
             5222.00  2.7% copy_user_generic_string  [guest.kernel.kallsyms]
             4450.00  2.3% ...
From: Ingo Molnar
Date: Friday, March 19, 2010 - 1:21 am

Nice progress!


Will be really be painful to developers - to enter that long line while we 
have these things called 'computers' that ought to reduce human work. Also, 
it's incomplete, we need access to the guest system's binaries to do ELF 
symbol resolution and dwarf decoding.

So we really need some good, automatic way to get to the guest symbol space, 
so that if a developer types:

   perf kvm top

Then the obvious thing happens by default. (which is to show the guest 
overhead)

There's no technical barrier on the perf tooling side to implement all that: 
perf supports build-ids extensively and can deal with multiple symbol spaces - 
as long as it has access to it. The guest kernel could be ID-ed based on its 
/sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids.

So some sort of --guestmount option would be the natural solution, which 
points to the guest system's root: and a Qemu enumeration of guest mounts 
(which would be off by default and configurable) from which perf can pick up 
the target guest all automatically. (obviously only under allowed permissions 
so that such access is secure)

This would allow not just kallsyms access via $guest/proc/kallsyms but also 
gives us the full space of symbol features: access to the guest binaries for 
annotation and general symbol resolution, command/binary name identification, 
etc.

Such a mount would obviously not broaden existing privileges - and as an 
additional control a guest would also have a way to indicate that it does not 
wish a guest mount at all.

Unfortunately, in a previous thread the Qemu maintainer has indicated that he 
will essentially NAK any attempt to enhance Qemu to provide an easily 
discoverable, self-contained, transparent guest mount on the host side.

No technical justification was given for that NAK, despite my repeated 
requests to particulate the exact security problems that such an approach 
would cause.

If that NAK does not stand in that form then i'd like ...
From: oerg Roedel
Date: Friday, March 19, 2010 - 10:29 am

I still think it is the best and most generic way to let the guest do
the symbol resolution. This has several advantages:

	1. The guest knows best about its symbol space. So this would be
	   extensible to other guest operating systems.  A brave
	   developer may even implement symbol passing for Windows or
	   the BSDs ;-)

	2. The guest can decide for its own if it want to pass this
	   inforamtion to the host-perf. No security issues at all.

	3. The guest can also pass us the call-chain and we don't need
	   to care about complicated of fetching from the guest
	   ourself.

	4. This way extensible to nested virtualization too.

How we speak to the guest was already discussed in this thread. My
personal opinion is that going through qemu is an unnecessary step and
we can solve that more clever and transparent for perf.

	Joerg

--

From: Ingo Molnar
Date: Sunday, March 21, 2010 - 11:43 am

Having access to the actual executable files that include the symbols achieves 
precisely that - with the additional robustness that all this functionality is 
concentrated into the host, while the guest side is kept minimal (and 

It can decide whether it exposes the files. Nor are there any "security 

You need to be aware of the fact that symbol resolution is a separate step 
from call chain generation.

I.e. call-chains are a (entirely) separate issue, and could reasonably be done 
in the guest or in the host.


Nested virtualization is actually already taken care of by the filesystem 
solution via an existing method called 'subdirectories'. If the guest offers 
sub-guests then those symbols will be exposed in a similar way via its own 
'guest files' directory hierarchy.

I.e. if we have 'Guest-2' nested inside 'the 'Guest-Fedora-1' instance, we get:

 /guests/
 /guests/Guest-Fedora-1/etc/
 /guests/Guest-Fedora-1/usr/

we'd also have:

 /guests/Guest-Fedora-1/guests/Guest-2/

So this is taken care of automatically.

I.e. none of the four 'advantages' listed here are actually advantages over my 

Meaning exactly what?

Thanks,

	Ingo
--

From: oerg Roedel
Date: Monday, March 22, 2010 - 3:14 am

If you want to access the guests file-system you need a piece of
software running in the guest which gives you this access. But when you
get an event this piece of software may not be runnable (if the guest is
in an interrupt handler or any other non-preemptible code path). When the
host finally gets access to the guests filesystem again the source of
that event may already be gone (process has exited, module unloaded...).
The only way to solve that is to pass the event information to the guest

I am not talking about security. Security was sufficiently flamed about


Avi was against that but I think it would make sense to give names to
virtual machines (with a default, similar to network interface names).
Then we can create a directory in /dev/ with that name (e.g.
/dev/vm/fedora/). Inside the guest a (priviledged) process can create
some kind of named virt-pipe which results in a device file created in
the guests directory (perf could create /dev/vm/fedora/perf for
example). This file is used for guest-host communication.

Thanks,

	Joerg

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 3:37 am

You were talking about security, in the portion of your mail that you snipped 

I understood that portion to mean what it says: that your claim that your 

All i saw was my suggestion to allow a guest to securely (and scalably and 
conveniently) integrate/mount its filesystems to the host if both sides (both 
the host and the guest) permit it, to make it easier for instrumentation to 
pick up symbol details.

I.e. if a guest runs then its filesystem may be present on the host side as:

   /guests/Fedora-G1/
   /guests/Fedora-G1/proc/
   /guests/Fedora-G1/usr/
   /guests/Fedora-G1/.../

( This feature would be configurable and would be default-off, to maintain the 
  current status quo. )

i.e. it's a bit like sshfs or NFS or loopback block mounts, just in an 
integrated and working fashion (sshfs doesnt work well with /proc for example) 
and more guest transparent (obviously sshfs or NFS exports need per guest 
configuration), and lower overhead than sshfs/NFS - i.e. without the 
(unnecessary) networking overhead.

That suggestion was 'countered' by an unsubstantiated claim by Anthony that 
this kind of usability feature would somehow be a 'security nighmare'.

In reality it is just an incremental, more usable, faster and more 
guest-transparent form of what is already possible today via:

  - loopback mounts on host
  - NFS exports
  - SMB exports
  - sshfs
  - (and other mechanisms)

I wish there was at least flaming about it - as flames tend to have at least 
some specifics in them.

What i saw instead was a claim about a 'security nightmare', which was, when i 
asked for specifics, was followed by deafening silence. And you appear to have 
repeated that claim here, unwilling to back it up with specifics.

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 3:59 am

The very same is true of profiling in the host space as well (KVM is nothing 
special here, other than its unreasonable insistence on not enumerating 
readily available information in a more usable way).

So are you suggesting a solution to a perf problem we already solved 
differently? (and which i argue we solved in a better way)

We have solved that in the host space already (and quite elaborately so), and 
not via your suggestion of moving symbol resolution to a different stage, but 
by properly generating the right events to allow the post-processing stage to 
see processes that have already exited, to robustly handle files that have 
been rebuilt, etc.

From an instrumentation POV it is fundamentally better to acquire the right 
data and delay any complexities to the analysis stage (the perf model) than to 
complicate sampling (the oprofile dcookies model).

Your proposal of 'doing the symbol resolution in the guest context' is in 
essence re-arguing that very similar point that oprofile lost. Did you really 
intend to re-argue that point as well? If yes then please propose an 
alternative implementation for everything that perf does wrt. symbol lookups.

What we propose for 'perf kvm' right now is simply a straight-forward 
extension of the existing (and well working) symbol handling code to 

Best would be if you demonstrated any problems of the perf symbol lookup code 
you are aware of on the host side, as it has that exact design you are 
criticising here. We are eager to fix any bugs in it.

If you claim that it's buggy then that should very much be demonstratable - no 
need to go into theoretical arguments about it.

( You should be aware of the fact that perf currently works with 'processes
  exiting prematurely' and similar scenarios just fine, so if you want to

That is kind of half of my suggestion - the built-in enumeration guests and a 
guaranteed channel to them accessible to tools. (KVM already has its own 
special channel so it's not like channels ...
From: Joerg Roedel
Date: Monday, March 22, 2010 - 4:47 am

I am not claiming anything. I just try to imagine how your proposal
will look like in practice and forgot that symbol resolution is done at
a later point.
But even with defered symbol resolution we need more information from
the guest than just the rip falling out of KVM. The guest needs to tell
us about the process where the event happened (information that the host
has about itself without any hassle) and which executable-files it was

Probably. At least it is the solution that fits best into the current
design of perf. But we should think about how this will be done. Raw
disk access is no solution because we need to access virtual
file-systems of the guest too. Network filesystems may be a solution but
then we come back to the 'deployment-nightmare'.

	Joerg

--

From: Ingo Molnar
Date: Monday, March 22, 2010 - 5:26 am

Correct - for full information we need a good paravirt perf integration of the 
kernel bits to pass that through. (I.e. we want to 'integrate' the PID space 
as well, at least within the perf notion of PIDs.)


I never said anything about 'raw disk access'. Have you seen my proposal of 
(optional) VFS namespace integration? (It can be found repeated the Nth time 
in my mail you replied to)

Thanks,

	Ingo
--

From: Soeren Sandmann
Date: Tuesday, March 23, 2010 - 6:18 am

Slightly tangential, but there is another case that has some of the
same problems: profiling other language runtimes than C and C++, say
Python. At the moment profilers will generally tell you what is going
on inside the python runtime, but not what the python program itself
is doing.

To fix that problem, it seems like we need some way to have python
export what is going on. Maybe the same mechanism could be used to
both access what is going on in qemu and python.


Soren
--

From: Andi Kleen
Date: Tuesday, March 23, 2010 - 6:49 am

oprofile already has an interface to let JITs export
information about the JITed code. C Python is not a JIT,
but presumably one of the python JITs could do it.

http://oprofile.sourceforge.net/doc/devel/index.html

I know it's not envogue anymore and you won't be a approved 
cool kid if you do, but you could just use oprofile? 

Ok presumably one would need to do a python interface for this
first. I believe it's currently only implemented for Java and
Mono. I presume it might work today with IronPython on Mono.

IMHO it doesn't make sense to invent another interface for this,
although I'm sure someone will propose just that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Soeren Sandmann
Date: Tuesday, March 23, 2010 - 7:04 am

It's not that I personally want to profile a particular python
program. I'm interested in the more general problem of extracting more
information from profiled user space programs than just stack traces.

Examples:

        - What is going on inside QEMU? 

        - Which client is the X server servicing?

        - What parts of a python/shell/scheme/javascript program is
          taking the most CPU time?

I don't think the oprofile JIT interface solves any of these
problems. (In fact, I don't see why the JIT problem is even hard. The
JIT compiler can just generate a little ELF file with symbols in it,
and the profiler can pick it up through the mmap events that you get

I am bringing this up because I want to extend sysprof to be more
useful. 


Soren
--

From: Andi Kleen
Date: Tuesday, March 23, 2010 - 7:20 am

I suspect for those you rather need event based tracers of some sort,
similar to kernel trace points. Otherwise you would need own
separate stacks and other complications.

systemtap has some effort to use the dtrace instrumentation
that crops up in more and more user programs for this.  It wouldn't
surprise me if that was already in python and other programs
you're interested in.

I presume right now it only works if you apply the utrace monstrosity
though, but perhaps the new uprobes patches floating around 
will come to rescue.

There also was some effort to have a pure user space
daemon based approach for LTT, but I believe that currently
needs own trace points.

Again I fully expect someone to reinvent the wheel here

That would require keeping those temporary ELF files for
potentially unlimited time around (profilers today look at the ELF
files at the final analysis phase, which might be weeks away)

Also that would be a lot of overhead for the JIT and most likely
be a larger scale rewrite for a given JIT code base.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Arnaldo Carvalho de Melo
Date: Tuesday, March 23, 2010 - 7:29 am

'perf record' will traverse the perf.data file just collected and, if the
binaries have build-ids, will stash them in ~/.debug/, keyed by build-id
just like the -debuginfo packages do.

So only the binaries with hits. Also one can use 'perf archive' to
create a tar.bz2 file with the files with hits for the specified
perf.data file, that can then be transfered to another machine, whatever
arch, untarred at ~/.debug and then the report can be done there.

As it is done by build-id, multiple 'perf record' sessions share files
in the cache.

Right now the whole ELF file (or /proc/kallsyms copy) is stored if
collected from the DSO directly, or the bits that are stored in
-debuginfo files if we find it installed (so smaller). We could strip
that down further by storing just the ELF sections needed to make sense
of the symtab.

- Arnaldo
--

From: Frank Ch. Eigler
Date: Tuesday, March 23, 2010 - 7:46 am

These kinds of questions usually require navigation through internal
data of the user-space process ("Where in this linked list is this
pointer?"), and often also correlating them with history ("which
socket/fd was most recently serviced?").

Systemtap excels at letting one express such things.

- FChE
--

From: Arnaldo Carvalho de Melo
Date: Tuesday, March 23, 2010 - 7:10 am

perf also has supports for this and Pekka Enberg's jato uses it:

http://penberg.blogspot.com/2009/06/jato-has-profiler.html

- Arnaldo
--

From: Peter Zijlstra
Date: Tuesday, March 23, 2010 - 8:23 am

Right, we need to move that into a library though (always meant to do
that, never got around to doing it).

That way the app can link against a dso with weak empty stubs and have
perf record LD_PRELOAD a version that has a suitable implementation.

That all has the advantage of not exposing the actual interface like we
do now.
--

From: Zhang, Yanmin
Date: Monday, March 22, 2010 - 12:24 am

Yes, I agree with you and Avi that we need the enhancement be user-friendly.
One of my start points is to keep the tool having less dependency on
other components. Admin/developers could write script wrappers quickly if
I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely.
I could access the files quickly. However, it doesn't work when I access
/proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most
If sshfs could access /proc/ and /sys correctly, here is a design:
--guestmount points to a directory which consists of a list of sub-directories.
Every sub-directory's name is just the qemu process id of guest os. Admin/developer
mounts every guest os instance's root directory to corresponding sub-directory.

Then, perf could access all files. It's possible because guest os instance
happens to be multi-threading in a process. One of the defects is the accessing to


--

From: Arnaldo Carvalho de Melo
Date: Monday, March 22, 2010 - 9:44 am

If the MMAP events on the guest included a cookie that could later be
used to query for the symtab of that DSO, we wouldn't need to access the
guest FS at all, right?

With build-ids and debuginfo-install like tools the symbol resolution
could be performed by using the cookies (build-ids) as keys to get to
the *-debuginfo packages with matching symtabs (and DWARF for source
annotation, etc).

We have that for the kernel as:

[acme@doppio linux-2.6-tip]$ l /sys/kernel/notes 
-r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes
[acme@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id 
-r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id
[acme@doppio linux-2.6-tip]$

That way we would cover DSOs being reinstalled in long running 'perf
record' sessions too.

This was discussed some time ago but would require help from the bits
that load DSOs.

build-ids then would be first class citizens.

- Arnaldo
--

From: Zhang, Yanmin
Date: Monday, March 22, 2010 - 8:14 pm

It depends on specific sub commands. As for 'perf kvm top', developers want to see
the profiling immediately. Even with 'perf kvm record', developers also want to
see results quickly. At least I'm eager for the results when investigating
We can't make sure guest os uses the same os images, or don't know where we
could find the original DVD images being used to install guest os.

Current perf does save build id, including both kernls's and other application


--

From: Arnaldo Carvalho de Melo
Date: Tuesday, March 23, 2010 - 6:15 am

That is not a problem, if you have the relevant buildids in your cache
(Look in your machine at ~/.debug/), it will be as fast as ever.

If you use a distro that has its userspace with build-ids, you probably


You don't have to have guest and host sharing the same OS image, you
just have to somehow populate your buildid cache with what you need, be
it using sshfs or what Ingo is suggesting once, or using what your
vendor provides (debuginfo packages). And you just have to do it once,


But it doesn't fully supports right now, as I explained, build-ids are
collected at the end of the record session, because we have to open the
DSOs that had hits to get the 20 bytes cookie we need, the build-id.

If we had it in the PERF_RECORD_MMAP record, we would close this race,
and the added cost at load time should be minimal, to get the ELF
section with it and put it somewhere in task struct.

If only we could coalesce it a bit to reclaim this:

[acme@doppio linux-2.6-tip]$ pahole -C task_struct ../build/v2.6.34-rc1-tip+/kernel/sched.o  | tail -5
	/* size: 5968, cachelines: 94, members: 150 */
	/* sum members: 5943, holes: 7, sum holes: 25 */
	/* bit holes: 1, sum bit holes: 28 bits */
	/* last cacheline: 16 bytes */
};
[acme@doppio linux-2.6-tip]$ 

8-)

Or at least get just one of those 4 bytes holes then we could stick it
at the end to get our build-id there, accessing it would be done only
at PERF_RECORD_MMAP injection time, i.e. close to the time when we
actually are loading the executable mmap, i.e. close to the time when
the loader is injecting the build-id, I guess the extra memory and

- Arnaldo
--

From: Zhang, Yanmin
Date: Tuesday, March 23, 2010 - 6:39 pm

Previous thread: linux-next: Tree for March 16 by Stephen Rothwell on Monday, March 15, 2010 - 9:28 pm. (1 message)

Next thread: Re: + tmpfs-fix-oops-on-remounts-with-mpol=default.patch added to -mm tree by KOSAKI Motohiro on Monday, March 15, 2010 - 10:47 pm. (15 messages)