From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.
The patch adds new subcommand kvm to perf.
perf kvm top
perf kvm record
perf kvm report
perf kvm diff
The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.
Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top
--------------------------------------------------------------------------------------------------------------------------
PerfTop: 16010 irqs/sec kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _________________________ _______________________
38770.00 20.4% __ticket_spin_lock [guest.kernel.kallsyms]
22560.00 11.9% ftrace_likely_update [kernel.kallsyms]
9208.00 4.8% __lock_acquire [kernel.kallsyms]
5473.00 2.9% trace_hardirqs_off_caller [kernel.kallsyms]
5222.00 2.7% copy_user_generic_string [guest.kernel.kallsyms]
4450.00 2.3% validate_chain [kernel.kallsyms]
4262.00 2.2% trace_hardirqs_on_caller [kernel.kallsyms]
4239.00 2.2% do_raw_spin_lock [kernel.kallsyms]
3548.00 1.9% do_raw_spin_unlock [kernel.kallsyms]
2487.00 1.3% ...Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? Perhaps a symbol server that perf can connect to (and that would connect to guests Should be in common code, not vmx specific. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
The highest quality solution would be if KVM offered a 'guest extension' to the guest kernel's /proc/kallsyms that made it easy for user-space to get this information from an authorative source. That's the main reason why the host side /proc/kallsyms is so popular and so useful: while in theory it's mostly redundant information which can be gleaned from the System.map and other sources of symbol information, it's easily available and is _always_ trustable to come from the host kernel. Separate System.map's have a tendency to go out of sync (or go missing when a devel kernel gets rebuilt, or if a devel package is not installed), and server ports (be that a TCP port space server or an UDP port space mount-point) are both a configuration hassle and are not guest-transparent. So for instrumentation infrastructure (such as perf) we have a large and well founded preference for intrinsic, built-in, kernel-provided information: i.e. a largely 'built-in' and transparent mechanism to get to guest symbols. Thanks, Ingo --
The symbol server's client can certainly access the bits through vmchannel. -- error compiling committee.c: too many arguments to function --
Ok, that would work i suspect. Would be nice to have the symbol server in tools/perf/ and also make it easy to add it to the initrd via a .config switch or so. That would have basically all of the advantages of being built into the kernel (availability, configurability, transparency, hackability), while having all the advantages of a user-space approach as well (flexibility, extensibility, robustness, ease of maintenance, etc.). If only we had tools/xorg/ integrated via the initrd that way ;-) Thanks, Ingo --
Note, I am not advocating building the vmchannel client into the host kernel. While that makes everything simpler for the user, it increases the kernel footprint with all the disadvantages that come with that (any bug is converted into a host DoS or worse). So, perf would connect to qemu via (say) a well-known unix domain socket, which would then talk to the guest kernel. I know you won't like it, we'll continue to disagree on this unfortunately. -- error compiling committee.c: too many arguments to function --
Neither am i. What i suggested was a user-space binary/executable built in tools/perf and put into the initrd. That approach has the advantages i listed above, without having the disadvantages of in-kernel code you listed. Thanks, Ingo --
I'm confused - initrd seems to be guest-side. I was talking about the host side. For the guest, placing the symbol server in tools/ is reasonable. -- error compiling committee.c: too many arguments to function --
host side doesnt need much support - just some client capability in perf itself. I suspect vmchannels are sufficiently flexible and configuration-free for such purposes? (i.e. like a filesystem in essence) Ingo --
I haven't followed vmchannel closely, but I think it is. vmchannel is terminated in qemu on the host side, not in the host kernel. So perf would need to connect to qemu. -- error compiling committee.c: too many arguments to function --
Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? ( That is the general thought process how many cross-discipline useful desktop/server features hit the bit bucket before having had any chance of being vetted by users, and why Linux sucks so much when it comes to feature integration and application usability. ) Ingo --
libvirt manages qemu processes, but I don't think this should go through libvirt. qemu can do this directly by opening a unix domain socket in a You can't solve everything in the kernel, even with a well populated tools/. -- error compiling committee.c: too many arguments to function --
How do i get a list of all 'guest instance PIDs', and what is the way to talk I mean, i can trust a kernel service and i can trust /proc/kallsyms. Can perf trust a random process claiming to be Qemu? What's the trust So Qemu has never run into such problems before? ( Sounds weird - i think Qemu configuration itself should be done via a Certainly not, but this is a technical problem in the kernel's domain, so it's a fair (and natural) expectation to be able to solve this within the kernel project. Ingo --
Libvirt manages all qemus, but this should be implemented independently In general qemu exposes communication channels (such as the monitor) as Obviously you can't trust anything you get from a guest, no matter how you get it. How do you trust a userspace program's symbols? you don't. How do you That's exactly what happens. You invoke qemu with -monitor unix:blah,server (or -qmp for a machine-readable format) and have your management application connect to that. You can redirect guest serial ports, console, parallel port, etc. to unix-domain or tcp sockets. Someone writing perf-gui outside the kernel would have the same problems, no? -- error compiling committee.c: too many arguments to function --
I'm not talking about the symbol strings and addresses, and the object
contents for allocation (or debuginfo). I'm talking about the basic protocol
of establishing which guest is which.
I.e. we really want to be able users to:
1) have it all working with a single guest, without having to specify 'which'
guest (qemu PID) to work with. That is the dominant usecase both for
developers and for a fair portion of testers.
2) Have some reasonable symbolic identification for guests. For example a
usable approach would be to have 'perf kvm list', which would list all
currently active guests:
$ perf kvm list
[1] Fedora
[2] OpenSuse
[3] Windows-XP
[4] Windows-7
And from that point on 'perf kvm -g OpenSuse record' would do the obvious
thing. Users will be able to just use the 'OpenSuse' symbolic name for
that guest, even if the guest got restarted and switched its main PID.
Any such facility needs trusted enumeration and a protocol where i can trust
that the information i got is authorative. (I.e. 'OpenSuse' truly matches to
the OpenSuse session - not to some local user starting up a Qemu instance that
claims to be 'OpenSuse'.)
Is such a scheme possible/available? I suspect all the KVM configuration tools
(i havent used them in some time - gui and command-line tools alike) use
similar methods to ease guest management?
Ingo
--
There is none. So far, qemu only dealt with managing just its own guest, and left all multiple guest management to higher levels up the You can do that through libvirt, but that only works for guests started through libvirt. libvirt provides command-line tools to list and manage guests (for example autostarting them on startup), and tools built on top of libvirt can manage guests graphically. Looks like we have a layer inversion here. Maybe we need a plugin system - libvirt drops a .so into perf that teaches it how to list guests and get their symbols. -- error compiling committee.c: too many arguments to function --
IMO such ease of use is reasonable and required, full stop. If it cannot be gotten simply then that's a bug: either in the code, or in the design, or in the development process that led to the design. Bugs need Is libvirt used to start up all KVM guests? If not, if it's only used on some distros while other distros have other solutions then there's apparently no good way to get to such information, and the kernel bits of KVM do not provide it. To the user (and to me) this looks like a KVM bug / missing feature. (and the user doesnt care where the blame is) If that is true then apparently the current KVM design has no technically actionable solution for certain categories of features! Ingo --
Developers tend to start qemu from the command line, but the majority of users and all distros I know of use libvirt. Some users cobble up their A plugin system allows anyone who is interested to provide the information; they just need to write a plugin for their management tool. Since we can't prevent people from writing management tools, I don't see what else we can do. -- error compiling committee.c: too many arguments to function --
Perhaps the fact that kvm happens to deal with an interesting application area (virtualization) is misleading here. As far as the host kernel or other host userspace is concerned, qemu is just some random unprivileged userspace program (with some *optional* /dev/kvm services that might happen to require temporary root). As such, perf trying to instrument qemu is no different than perf trying to instrument any other userspace widget. Therefore, expecting 'trusted enumeration' of instances is just as sensible as using 'trusted ps' and 'trusted /var/run/FOO.pid files'. - FChE --
You are quite mistaken: KVM isnt really a 'random unprivileged application' in this context, it is clearly an extension of system/kernel services. ( Which can be seen from the simple fact that what started the discussion was 'how do we get /proc/kallsyms from the guest'. I.e. an extension of the existing host-space /proc/kallsyms was desired. ) In that sense the most natural 'extension' would be the solution i mentioned a week or two ago: to have a (read only) mount of all guest filesystems, plus a channel for profiling/tracing data. That would make symbol parsing easier and it's what extends the existing 'host space' abstraction in the most natural way. ( It doesnt even have to be done via the kernel - Qemu could implement that via FUSE for example. ) As a second best option a 'symbol server' might be used too. Thanks, Ingo --
Hi - I don't know what "extension of system/kernel services" means in this context, beyond something running on the system/kernel, like every other process. To clarify, to what extent do you consider your classification similarly clear for a host is running * multiple kvm instances run as unprivileged users * non-kvm OS simulators such as vmware or xen or gdb (Sorry, that smacks of circular reasoning.) It may be a charming convenience function for perf users to give them shortcuts for certain favoured configurations (kvm running freshest linux), but that says more about perf than kvm. - FChE --
It means something like my example of 'extended to guest space' To me it sounds like an example supporting my point. /proc/kallsyms is a service by the kernel, and 'perf kvm' desires this to be extended to guest space as well. Thanks, Ingo --
Random tools (like perf) should not be able to do what you describe. It's a security nightmare. If it's desirable to have /proc/kallsyms available, we can expose an interface in QEMU to provide that. That can then be plumbed through libvirt and QMP. Then a management tool can use libvirt or QMP to obtain that information No way. The guest has sensitive data and exposing it widely on the host is a bad thing to do. It's a bad interface. We can expose specific information about guests but only through our existing channels which are validated through a security infrastructure. Ultimately, your goal is to keep perf a simple tool with little dependencies. But practically speaking, if you want to add features to it, it's going to have to interact with other subsystems in the appropriate way. That means, it's going to need to interact with libvirt or QMP. If you want all applications to expose their data via synthetic file systems, then there's always plan9 :-) Regards, Anthony Liguori --
A security nightmare exactly how? Mind to go into details as i dont understand Firstly, you are putting words into my mouth, as i said nothing about 'exposing it widely'. I suggest exposing it under the privileges of whoever has access to the guest image. Secondly, regarding confidentiality, and this is guest security 101: whoever can access the image on the host _already_ has access to all the guest data! A Linux image can generally be loopback mounted straight away: losetup -o 32256 /dev/loop0 ./guest-image.img mount -o ro /dev/loop0 /mnt-guest (Or, if you are an unprivileged user who cannot mount, it can be read via ext2 tools.) There's nothing the guest can do about that. The host is in total control of guest image data for heaven's sake! All i'm suggesting is to make what is already possible more convenient. Ingo --
Assume you're using SELinux to implement mandatory access control. How do you label this file system? Generally speaking, we don't know the difference between /proc/kallsyms vs. /dev/mem if we do generic passthrough. While it might be safe to have a relaxed label of kallsyms (since it's read only), it's clearly not safe to do that for /dev/mem, /etc/shadow, or any file containing sensitive information. Rather, we ought to expose a higher level interface that we have more confidence in with respect to understanding the ramifications of That doesn't work as nicely with SELinux. It's completely reasonable to have a user that can interact in a read only mode with a VM via libvirt but cannot read the guest's disk images It's not that simple in a MAC environment. Regards, --
What's your _point_? Please outline a threat model, a vector of attack, Exactly, we want something that has a flexible namespace and works well with Linux tools in general. Preferably that namespace should be human readable, and it should be hierarchic, and it should have a well-known permission model. If a user cannot read the image file then the user has no access to its contents via other namespaces either. That is, of course, a basic security aspect. ( That is perfectly true with a non-SELinux Unix permission model as well, and Erm. Please explain to me, what exactly is 'not that simple' in a MAC environment? Also, i'd like to note that the 'restrictive SELinux setups' usecases are pretty secondary. To demonstrate that, i'd like every KVM developer on this list who reads this mail and who has their home development system where they produce their patches set up in a restrictive MAC environment, in that you cannot even read the images you are using, to chime in with a "I'm doing that" reply. If there's just a _single_ KVM developer amongst dozens and dozens of developers on this list who develops in an environment like that i'd be surprised. That result should pretty much tell you where the weight of instrumentation focus should lie - and it isnt on restrictive MAC environments ... Ingo --
You suggested "to have a (read only) mount of all guest filesystems". As I described earlier, not all of the information within the guest filesystem has the same level of sensitivity. If you exposed a generic interface like this, it makes it very difficult to delegate privileges. Delegating privileges is important because from in a higher security environment, you may want to prevent a management tool from accessing the VM's disk directly, but still allow it to do basic operations (in If you want to use a synthetic filesystem as the management interface for qemu, that's one thing. But you suggested exposing the guest I don't think that's reasonable at all. The guest may encrypt it's disk My home system doesn't run SELinux but I work daily with systems that are using SELinux. I want to be able to run tools like perf on these systems because ultimately, I need to debug these systems on a daily basis. But that's missing the point. We want to have an interface that works for both cases so that we're not maintaining two separate interfaces. We've rat holed a bit though. You want: 1) to run perf kvm list and be able to enumerate KVM guests 2) for this to Just Work with qemu guests launched from the command line You could achieve (1) by tying perf to libvirt but that won't work for (2). There are a few practical problems with (2). qemu does not require the user to associate any uniquely identifying information with a VM. We've also optimized the command line use case so that if all you want to do is run a disk image, you just execute "qemu foo.img". To satisfy your use case, we would either have to force a use to always specify unique information, which would be less convenient for our users or we would have to let the name be an optional parameter. As it turns out, we already support "qemu -name Fedora foo.img". What we don't do today, but I've been suggesting we should, is automatically create a QMP management socket in a well ...
Hi - To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? That way guests could run perf goo internally. Optionally virt tools on the host side could aggregate data from cooperating self-monitoring guests. - FChE --
That's the more interesting (by far) usage model. In general guest owners don't have access to the host, and host owners can't (and shouldn't) change guests. Monitoring guests from the host is useful for kvm developers, but less so for users. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Ingo --
Of course I understood it. My point was that 'perf kvm' serves a tiny minority of users. That doesn't mean it isn't useful, just that it doesn't satisfy all needs by itself. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I hope you wont be disappointed to learn that 100% of Linux, all 13+ million lines of it, was and is being developed by a tiny, tiny, tiny minority of Of course - and it doesnt bring world peace either. One step at a time. Thanks, Ingo --
Hi Avi, Ingo, I've been following through this long thread since the very first email. I'm a performance engineer whose job is to tune workloads run on top of KVM (and Xen previously). As a performance engineer, I desperately want to have a tool that can monitor the host and guests at same time. Think about >100 guests mixed with Linux/Windows running together on single system, being able to know what's happening is critical to do performance analysis. Actually I am the person asked Yanmin to add feature for CPU utilization break down (into host_usr, host_krn, guest_usr, guest_krn) so that I can monitor dozens of running guests. I hasn't made this patch work on my system yet but I _do_ think this patch is a very good start. And finally, monitoring guests from host is useful for users too (administrator and performance guy like me). I really appreciate you guys' work and would love to provide feedback from my point of view if needed. Regards, HUANG, Zhiteng Intel SSG/SSD/SPA/PRC Scalability Lab -----Original Message----- From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On Behalf Of Avi Kivity Sent: Wednesday, March 17, 2010 11:55 AM To: Frank Ch. Eigler Cc: Anthony Liguori; Ingo Molnar; Zhang, Yanmin; Peter Zijlstra; Sheng Yang; linux-kernel@vger.kernel.org; kvm@vger.kernel.org; Marcelo Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; ziteng.huang@intel.com Subject: Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side That's the more interesting (by far) usage model. In general guest owners don't have access to the host, and host owners can't (and shouldn't) change guests. Monitoring guests from the host is useful for kvm developers, but less so for users. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone else: it offers the /dev/perfctr special-purpose device that allows raw, unabstracted, low-level access to the PMU. I suspect the one you wanted to mention here is called 'perf' or 'perf events'. (and used to be called 'performance counters' or 'perfcounters' until it got renamed about a year ago) Thanks, Ingo --
What did you think, that it would be world-readable? Why would we do such a stupid thing? Any mounted content should at minimum match whatever policy covers the image file. The mounting of contents is not a privilege escallation and it is already possible today - just not integrated properly and not practical. (and apparently not implemented for all the wrong 'security' _In_ the guest you can of course run it just fine. (once paravirt bits are in place) That has no connection to 'perf kvm' though, which this patch submission is about ... If you want unified profiling of both host and guest then you need access to both the guest and the host. This is what the 'perf kvm' patch is about. Please read the patch, i think you might be misunderstanding what it does ... Regarding encrypted contents - that's really a distraction but the host has absolute, 100% control over the guest and there's nothing the guest can do about that - unless you are thinking about the sub-sub-case of Orwellian DRM-locked-down systems - in which case there's nothing for the host to mount and the guest can reject any requests for information on itself and impose additional policy that way. So it's a security non-issue. Note that DRM is pretty much the worst place to look at when it comes to usability: DRM lock-down is the anti-thesis of usability. Do you really want KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a developer cannot mount his own disk image? Pretty please, help Linux instead, where development is driven by usability and accessibility ... Thanks, Ingo --
You're making too many assumptions. There is no list of guests anymore than there is a list of web browsers. You can have a multi-tenant scenario where you have distinct groups of Does "perf kvm list" always run as root? What if two unprivileged users both have a VM named "Fedora"? If we look at the use-case, it's going to be something like, a user is creating virtual machines and wants to get performance information about them. Having to run a separate tool like perf is not going to be what they would expect they had to do. Instead, they would either use their existing GUI tool (like virt-manager) or they would use their management interface (either QMP or libvirt). The complexity of interaction is due to the fact that perf shouldn't be a stand alone tool. It should be a library or something with a programmatic interface that another tool can make use of. Regards, --
"multi-tenant" and groups is not a valid excuse at all for giving crappy technology in the simplest case: when there's a single VM. Yes, eventually it can be supported and any sane scheme will naturally support it too, but it's by no means what we care about primarily when it comes to these tools. I thought everyone learned the lesson behind SystemTap's failure (and to a certain degree this was behind Oprofile's failure as well): when it comes to tooling/instrumentation we dont want to concentrate on the fancy complex setups and abstract requirements drawn up by CIOs, as development isnt being done there. Concentrate on our developers today, and provide no-compromises usability to those who contribute stuff. If we dont help make the simplest (and most common) use-case convenient then Again, the single-VM case is the most important case, by far. If you have multiple VMs running and want to develop the kernel on multiple VMs (sounds rather messy if you think it through ...), what would happen is similar to what happens when we have two probes for example: # perf probe schedule Added new event: probe:schedule (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule -a sleep 1 # perf probe -f schedule Added new event: probe:schedule_1 (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule_1 -a sleep 1 # perf probe -f schedule Added new event: probe:schedule_2 (on schedule+0) You can now use it on all perf tools, such as: perf record -e probe:schedule_2 -a sleep 1 Something similar could be used for KVM/Qemu: whichever got created first is But ... a GUI interface/integration is of course possible too, and it's being worked on. perf is mainly a kernel developer tool, and kernel developers generally dont use GUIs to do their stuff: which is the (sole) reason why ...
It's about who owns the user interface. If qemu owns the user interface, than we can satisfy this in a very simple way by adding a perf monitor command. If we have to support third party tools, then it significantly complicates things. Regards, --
Of course illogical modularization complicates things 'significantly'. I wish both you and Avi looked back 3-4 years and realized what made KVM so successful back then and why the hearts and minds of virtualization developers were captured by KVM almost overnight. KVM's main strength back then was that it was a surprisingly functional piece of code offered by a 10 KLOC patch - right on the very latest upstream kernel. Code was shared with upstream, there was version parity, and it all was in the same single repo which was (and is) a pleasure to develop on. Unlike Xen, which was a 200+ KLOC patch on top of a forked 10 MLOC kernel a few upstream versions back. Xen had constant version friction due to that fork and due to that forced/false separation/modularization: Xen _itself_ was a fork of Linux to begin with. (for exampe Xen still had my copyrights last i checked, which it got from old Linux code i worked on) That forced separation and version friction in Xen was a development and productization nightmare, and developing on KVM was a truly refreshing experience. (I'll go out on a limb to declare that you wont find a _single_ developer on this list who will tells us otherwise.) Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the worst-quality remaining aspects of KVM are precisely in areas that you mention: 'if we have to support third party tools, then it significantly complicates things'. You kept Qemu as an external 'third party' entity to KVM, and KVM is clearly hurting from that - just see the recent KVM usability thread for examples about suckage. So a similar 'complication' is the crux of the matter behind KVM quality problems: you've not followed through with the original KVM vision and you have not applied that concept to Qemu! And please realize that the user does not care that KVM's kernel bits are top notch, if the rest of the package has sucky aspects: it's always the weakest link of the chain that ...
Any qemu usability problems are because developers (or their employers) are not interested in fixing them, not because of the repository location. Most kvm developer interest is in server-side deployment (even for desktop guests), so there is limited effort in implementing a I'll ignore the repository location which should be immaterial to a serious developer and concentrate on the 'clean and minimal' aspect, since it has some merit. Qemu development does have a tension between the needs of kvm and tcg. For kvm we need fine-grained threading to improve performance and tons of RAS work. For tcg these are mostly meaningless, and the tcg code has sufficient inertia to reduce the rate at which we can develop. Nevertheless, the majority of developers feel that we'll lose more by a The majority of patches to qemu don't require changes to kvm, and vice versa. The interface between qemu and kvm is fairly narrow, and most of the changes are related to save/restore and guest debugging, hardly When a feature is developed that requires both kernel and qemu changes, the same developer makes the changes in both projects. Having them in Let's make a list of projects who don't need to be in the kernel repository, it will probably be shorted. Seriously, libvirt is a cross-platform cross-hypervisor library, it has In fact I try hard not to rely too much on that. While both kvm guest and host code are in the same repo, there is an ABI barrier between them because we need to support any guest version on any host version. When designing, writing, or reading guest or host code that interacts across that barrier we need to keep forward and backward compatibility in mind. It's very different from normal kernel APIs that we can adapt I really don't understand why you believe that. You seem to want a virtualbox-style GUI, and lkml is probably the last place in the world to develop something like that. The developers here are mostly uninterested in ...
If qemu was in tools/kvm/ then we wouldnt have such issues. A single patch (or series of patches) could modify tools/kvm/, arch/x86/kvm/, virt/ and tools/perf/. Numerous times did we have patches to kernel/perf_event.c that fixed some detail, also accompanied by a tools/perf/ patch fixing another detail. Having a single 'culture of contribution' is a powerful way to develop. It turns out kernel developers can be pretty good user-space developers as well and user-space developers can be pretty good kernel developers as well. Some like to do both - as long as it's all within a single project. The moment any change (be it as trivial as fixing a GUI detail or as complex as a new feature) involves two or more packages, development speed slows down to a crawl - while the complexity of the change might be very low! Also, there's the harmful process that people start categorizing themselves into 'I am a kernel developer' and 'I am a user space programmer' stereotypes, The same has been said of oprofile as well: 'it somewhat sucks because we are too server centric', 'nobody is interested in good usability and oprofile is fine for the enterprises'. Ironically, the same has been said of Xen usability as well, up to the point KVM came around. What was the core of the problem was a bad design and a split kernel-side user-side tool landscape. In fact i think saying that 'our developers only care about the server' is borderline dishonest, when at the same time you are making it doubly sure (by inaction) that it stays so: by leaving an artificial package wall between kernel-side KVM and user-side KVM and not integrating the two technologies. You'll never know what heights you could achieve if you leave that wall there ... Furthermore, what should be realized is that bad usability hurts "server features" just as much. Most of the day-to-day testing is done on the desktop by desktop oriented testers/developers. _Not_ by enterprise shops - they tend to see ...
It's not a 1:1 connection. There are more users of the KVM interface. To name a few I'm aware of: - Mac-on-Linux (PPC) - Dolphin (PPC) - Xenner (x86) - Kuli (s390) Having a clear userspace interface is the only viable solution there. And if you're interested, look at my MOL enabling patch. It's less than 500 lines of code. The kernel/userspace interface really isn't the difficult part. Getting device emulation working properly, easily and fast is. Alex--
There must be a misunderstanding here: tools/perf/ still has a clear userspace interface and ABI. There's external projects making use of it: sysprof and libpfm (and probably more i dont know about it). Those projects are also contributing back. Still it's _very_ useful to have a single reference implementation under tools/perf/ where we concentrate the best of the code. That is where we make sure that each new kernel feature is appropriately implemented in user-space as well, that the combination works well together and is releasable to users. That is what keeps us all honest: the latency of features is much lower, and there's no ping-pong of blame going on between the two components in case of bugs or in case of misfeatures. Same goes for KVM+Qemu: it would be so much nicer to have a single, well-focused reference implementation under tools/kvm/ and have improvements flow into that code base. That way KVM developers cannot just shrug "well, GUI suckage is a user-space problem" - like the answers i got in the KVM usability thread ... The buck will stop here. And if someone thinks he can do better an external project can be started Why do you suppose that what i propose is an "either or" scenario? It isnt. I just suggested that instead of letting core KVM fragment its limbs into an external entity, put your name behind one good all-around solution and focus the development model into a single project. I.e. do what KVM has done originally in the kernel space to begin with - and where it was so much better than Xen: single focus. Learn from what KVM has done so well in the initial years and use the concept on the user-space components as well. The very same arguments that caused KVM to integrate into the upstream kernel (instead of being a separate project) are a valid basis to integrate the user-space components into tools/kvm/. Dont The kernel/userspace ABI is not difficult at all. Getting device emulation working properly, easily and ...
That would make sense for a truly minimal userspace for kvm: we once had a tool called kvmctl which was used to run tests (since folded into qemu). It didn't contain a GUI and was unable to run a general purpose guest. It was a few hundred lines of code, and indeed patches to kvmctl had a much closer correspondence to patches with kvm (though still low, Suppose we copy qemu tomorrow into tools/. All the problems will be copied with it. Someone still has to write patches to fix them. Who Moving emulation into the kernel is indeed a problem. Not because it's difficult, but because it indicates that the interfaces exposed to userspace are insufficient to obtain good performance. We had that with That's reasonable in the first iterations of a project. -- error compiling committee.c: too many arguments to function --
If it's functional to the extent of at least allowing say a serial console via the console (like the UML binary allows) i'd expect the minimal user-space to quickly grow out of this minimal state. The rest will be history. Maybe this is a better, simpler (and much cleaner and less controversial) approach than moving a 'full' copy of qemu there. There's certainly no risk: if qemu stays dominant then nothing is lost [tools/kvm/ can be removed after some time], and if this clean base works out fine then the useful qemu technologies will move over to it gradually and without much fuss, and the developers will move with it as well. If it's just a token effort with near zero utility to begin with it certainly wont take off. Once it's there in tools/kvm/ and bootable i'd certainly hack up some quick xlib based VGA output capability myself - it's not that hard ;-) It would also allow me to test whether latest-KVM still boots fine in a much simpler way. (most of my testboxes dont have qemu installed) What we saw with tools/perf/ was that pure proximity to actual kernel testers and kernel developers produces a steady influx of new developers. It didnt happen overnight, but it happened. A simple: cd tools/perf/ make -j install Gets them something to play with. That kind of proximity is very powerful. The other benefit was that distros can package perf with the kernel package, so it's updated together with the kernel. This means a very efficient distribution of new technologies, together with new kernel releases. Distributions are very eager to update kernels even in stable periods of the distro lifetime - they are much less willing to update user-space packages. You can literally get full KVM+userspace features done _and deployed to users_ within the 3 months development cycle of upstream KVM. All these create synergies that are very clear once you see the process in motion. It's a powerful positive feedback loop. Give it some thought ...
Alright, you just volunteered. Just give it a go and try to implement the "oh so simple" KVM frontend while maintaining compatibility with at least a few older Linux guests. My guess is that you'll realize it's a dead end before committing anything to the kernel source tree. But really, just try it out. Good Luck Alex --
Sorry, er, what? What distributions eagerly upgrade kernels in stable periods, were it not primarily motivated by security fixes? What users eagerly replace their kernels? - FChE --
Us guys reading and participating on the list. ;) --
I'd like to second that - i'm actually quite happy to update the distro kernel. Also, i have rarely any problems even with bleeding edge kernels in rawhide - they are working pretty smoothly. A large xorg update showing up in yum update gives me the cringe though ;-) Ingo --
Hi - From a parochial point of view, that makes perfect sense: someone else's large software changes are a source of concern. The same thing applies to non-LKML people -- ordinary users -- when *your* large software changes are proposed. Perhaps this change in perspective would help you see the absurdity of proposing kernel-2.6.git as a hosting repository for all kinds of stuff, on the theory that kernel updates get pushed to "eager" users more frequently than other kinds of updates. (Never mind that data shows otherwise.) - FChE --
Please check the popular distro called 'Fedora' for example, and its kernel Those 99% who click on the 'install 193 updates' popup. Ingo --
Of which 1 is the kernel, and 192 are userspace updates (of which one may be qemu). -- error compiling committee.c: too many arguments to function --
I think you didnt understand my (tersely explained) point - which is probably my fault. What i said is: - distros update the kernel first. Often in stable releases as well if there's a new kernel released. (They must because it provides new hardware enablement and other critical changes they generally cannot skip.) - Qemu on the other hand is not upgraded with (nearly) that level of urgency. Completely new versions will generally have to wait for the next distro release. With in-kernel tools the kernel and the tooling that accompanies the kernel are upgraded in the same low-latency pathway. That is a big plus if you are offering things like instrumentation (which perf does), which relates closely to the kernel. Furthermore, many distros package up the latest -git kernel as well. They almost never do that with user-space packages. Let me give you a specific example: I'm running Fedora Rawhide with 2.6.34-rc1 right now on my main desktop, and that comes with perf-2.6.34-0.10.rc1.git0.fc14.noarch. My rawhide box has qemu-kvm-0.12.3-3.fc14.x86_64 installed. That's more than a 1000 Qemu commits older than the latest Qemu development branch. So by being part of the kernel repo there's lower latency upgrades and earlier and better testing available on most distros. You made it very clear that you dont want that, but please dont try to claim that those advantages do not exist - they are very much real and we are making good use of it. Thanks, Ingo --
This has nothing todo with them being in separate source repos. We could update QEMU to new major feature releaes with the same frequency in a Fedora release, but we delibrately choose not to rebase the QEMU userspace because experiance has shown the downside from new bugs / regressions outweighs the benefit of any new features. The QEMU updates in stable Fedora trees, now just follow the minor bugfix release stream provided by QEMU & those arrive in Fedora with little noticable delay. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
That is exactly what i said: Qemu and most user-space packages are on a 'slower' update track than the kernel: generally updated for minor releases. My further point was that the kernel on the other hand gets updated more frequently and as such, any user-space tool bits hosted in the kernel repo get updated more frequently as well. Thanks, Ingo --
Just to play devil's advocate, let's not mix up the development model with the distribution model. There is nothing to stop packagers and distributors from providing separate kernel "proper" packages and perf tools packages. It might even make good sense assuming backwards compatibility for distros that have conservative policies about new kernel versions to provide newer perf tools packages with older kernels. John --
Of course. Some distros are also very conservative about updating the kernel at all. I'm mostly talking about the distros that are at the frontier of kernel development: those with fresh packages, those which provide eager bleeding-edge testers and developers. Ingo --
No, they don't. RHEL 5 is still on 2.6.18, for example. Users don't like their kernels updated unless absolutely necessary, with good reason. F12 recently updated to 2.6.32. This is probably due to 2.6.31.stable dropping away, and no capacity at Fedora to maintain it on their own. So they are caught in a bind - stay on 2.6.31 and expose users to security vulnerabilities or move to 2.6.32 and cause regressions. Not a I'm sure if we ask the Fedora qemu maintainer to package qemu-kvm.git I don't mind at all if rawhide users run on the latest and greatest, but release users deserve a little more stability. -- error compiling committee.c: too many arguments to function --
I just replied to Frank Ch. Eigler with a specific example that shows how this If you check the update frequency of RHEL 5 kernels you'll see that it's Happy choice or not, this is what i said is the distro practice these days. (i Rawhide is generally for latest released versions, to ready them for the next distro release - with special exception for the kernel, which has a special position due being a hardware-enabler and because it has an extremely predictable release schedule of every 90 days (+- 10 days). Very rarely do distro people jump versions for things like GCC or Xorg or Gnome/KDE, but they've been burned enough times by unexpected delays in those projects to be really loathe to do it. Qemu might get an exception - dunno, you could ask. My point still holds: by hosting KVM user-space bits in the kernel together with the rest of KVM you get version parity - which has clear advantages. What are you suggesting, that released versions of KVM are not reliable? Of course any tools/ bits are release engineered just as much as the rest of KVM ... Ingo --
I'm sorry to say that's pretty bad. Users don't want to update their So in addition to all the normal kernel regressions, you want to force No, I am suggesting qemu-kvm.git is not as stable as released versions (and won't get fixed backported). Keep in mind that unlike many userspace applications, qemu exposes an ABI to guests which we must keep compatible. -- error compiling committee.c: too many arguments to function --
So instead you force a NxN compatibility matrix [all versions of qemu combined with all versions of the kernel] instead of a linear N versions matrix with a clear focus on the last version. Brilliant engineering i have to say ;-) Also, by your argument the kernel should be split up into a micro-kernel, with different packages for KVM, scheduler, drivers, upgradeable separately. That would be a nightmare. (i can detail many facets of that nightmare if you insist but i'll spare the electrons for now) Fortunately few kernel developers I think you still dont understand it: if a tool moves to the kernel repo, then it is _released stable_ together with the next stable kernel. I.e. you'd get a stable qemu-2.6.34 in essence, when v2.6.34 is released. You get minor updates with 2.6.34.1, 2.6.34.2, 2.6.34.3, etc - while development continues. I.e. you get _more_ stability, because a matching kernel is released with a matching Qemu. Qemu might have a different release schedule. Which, i argue, is not a good thing for exactly that reason :-) If it moved to tools/kvm/ it would get the same 90 days release frequency, merge window and stabilization window treatment as the upstream kernel. Furthermore, users can also run experimental versions of qemu together with experimental versions of the kernel, by running something like 2.6.34-rc1 on Rawhide. Even if they dont download the latest qemu git and build it. I.e. clearly _more_ is possible in such a scheme. Ingo --
Thanks. In fact with have an QxKxGxT compatibility matrix since we need to keep compatibility with guests and with tools. Since the easiest interface to keep compatible is the qemu/kernel interface, allowing the kernel and qemu to change independently allows reducing the compatibility matrix while still providing some improvements. Regardless of that I'd keep binary compatibility anyway. Not everyone is on the update treadmill with everything updating every three months Some kernels do provide some of that facility (without being microkernels), for example the Windows and RHEL kernels. So it seems I was confused by the talk about 2.6.34-rc1, which isn't stable. -- error compiling committee.c: too many arguments to function --
I do believe I've heard of it. According to fedora bodhi, there have been 18 kernel updates issues for fedora 11 since its release, of which 12 were for purely security updates, and most of the other six also contain security fixes. None are described as 'enhancement' updates. Oh, what about fedora 12? 8 updates total, of which 5 are security only, one for drm showstoppers, others including security fixes, again 0 tagged as 'enhancement'. So where is that "eagerness" again?? My sense is that most users are happy to leave a stable kernel running as long as possible, and distributions know this. You surely must understand that the lkml That's not "eager". That's "I'm exasperated from guessing what's really important; let's not have so many updates; meh". - FChE --
You are quite wrong, despite the sarcastic tone you are attempting to use, and this is distro kernel policy 101. For distros such as Fedora it's simpler to support the same kernel version across many older versions of the distro than having to support different kernel versions. Check Fedora 12 for example. Four months ago it was released with kernel v2.6.31: http://download.fedora.redhat.com/pub/fedora/linux/releases/12/Fedora/x86_64/os/Packag... But if you update a Fedora 12 installation today you'll get kernel v2.6.32: http://download.fedora.redhat.com/pub/fedora/linux/updates/12/SRPMS/kernel-2.6.32.9-70... As a result you'll get a new 2.6.32 kernel on Fedora 12. The end result is what i said in the previous mail: that you'll get a newer kernel even on a stable distro - while user-space packages will only be updated if there's a security issue (and even then there's no version jump Erm, fact is, 99% [WAG] of the users click on the update button and accept whatever kernel version the distro update offers them. Ingo --
We would have exactly the same issues, only they would be in a single repository. The only difference is that we could ignore potential alternatives to qemu, libvirt, and RHEV-M. But that's not how kernel ABIs are developed, we try to make them general, not suited to just one In fact kvm started out in a single repo, and it certainly made it easy to bring it up in baby steps. But we've long outgrown that. Maybe the difference is that perf is still new and thus needs tight cooperation. If/when perf gains a real GUI, I doubt more than 1% of the patches will Very childish of them. If someone wants to contribute to a userspace project, they can swallow their pride and send patches to a non-kernel Why is that? I the maintainers of all packages are cooperative and responsive, then the patches will get accepted quickly. If they aren't, development will be slow. It isn't any different from contributing to two unrelated kernel subsystems (which are in fact in different repositories until the You're encouraging this with your proposal. You're basically using the I can accept the bad design (not knowing any of the details), but how The wall is maybe four nanometers high. Please be serious. If someone wants to work on qemu usability all they have to do is to clone the repository and start sending patches to qemu-devel@. What's gained by putting it in the kernel repository? You're saving a minute's worth of I'm not saying that improved usability isn't a good thing, but time spent on improving the GUI is time not spent on the features that we really want. Desktop oriented users also rarely test 16 vcpu guests with tons of RAM exercising 10Gb NICs and a SAN. Instead they care about graphics It's hard to contribute a patch that goes against the architecture of the system, where kvm deals with cpu virtualization, qemu (or theoretically another tool) manages a guest, and libvirt (or another tool) manages the host. You want a list of ...
Not at all - as i replied to in a previous mail, tools/perf/ still has a clear userspace interface and ABI, and external projects are making use of it. So there's no problem with the ABI at all. In fact our experience has been the opposite: the perf ABI is markedly better _because_ there's an immediate consumer of it in the form of tools/perf/. It gets tested better and external projects can get their ABI tweaks in as well and can provide a reference implementation for tools/perf. This has happened a couple of times. It's a win-win scenario. So the exact opposite of what you suggest is happening in practice. Thanks, Ingo --
It's very simple: because the contribution latencies and overhead compound, almost inevitably. If you ever tried to implement a combo GCC+glibc+kernel feature you'll know ... Even with the best-run projects in existence it takes forever and is very I'm afraid practice is different from the rosy ideal you paint there. Even with assumed 'perfect projects' there's always random differences between projects, causing doubled (tripled) overhead and compounded up overhead: - random differences in release schedules - random differences in contribution guidelines You mention a perfect example: contributing to multipe kernel subsystems. Even _that_ is very noticeably harder than contributing to a single subsystem - due to the inevitable buerocratic overhead, due to different development trees, due to different merge criteria. So you are underlining my point (perhaps without intending to): treating closely related bits of technology as a single project is much better. Obviously arch/x86/kvm/, virt/ and tools/kvm/ should live in a single development repository (perhaps micro-differentiated by a few topical branches), for exactly those reasons you mention. Just like tools/perf/ and kernel/perf_event.c and arch/*/kernel/perf*.c are treated as a single project. [ Note: we actually started from a 'split' design [almost everyone picks that, because of this false 'kernel space bits must be separate from user space bits' myth] where the user-space component was a separate code base and unified it later on as the project progressed. Trust me, the practical benefits of the unified approach are enormous to developers and to users alike, and there was no looking back once we made the switch. ] Also, i dont really try to 'convince' you here - you made your position very clear early on and despite many unopposed technical arguments i made, the positions seem to have hardened and i expect it wont change, no matter what arguments i bring. ...
It's not inevitable, if the projects are badly run, you'll have high How is a patch for the qemu GUI eject button and the kvm shadow mmu related? Should a single maintainer deal with both? -- error compiling committee.c: too many arguments to function --
We have co-maintainers for perf that have a different focus. It works pretty
well.
Look at git log tools/perf/ and how user-space and kernel-space components
interact in practice. You'll patches that only impact one side, but you'll see
very big overlap both in contributor identity and in patches as well.
Also, let me put similar questions in a bit different way:
- ' how is an in-kernel PIT emulation connected to Qemu's PIT emulation? '
- ' how is the in-kernel dynticks implementation related to Qemu's
implementation of hardware timers? '
- ' how is an in-kernel event for a CD-ROM eject connected to an in-Qemu
eject event? '
- ' how is a new hardware virtualization feature related to being able to
configure and use it via Qemu? '
- ' how is the in-kernel x86 decoder/emulator related to the Qemu x86
emulator? '
- ' how is the performance of the qemu GUI related to the way VGA buffers are
mapped and accelerated by KVM? '
They are obviously deeply related. The quality of a development process is not
defined by the easy cases where no project unification is needed. The quality
of a development process is defined by the _difficult_ cases.
Ingo
--
Where people sent patches, it doesn't suck (or sucks less). Where they don't, it still sucks. And it cost way more than $64K. And it works well when I have patches that change x86 core and kvm. But Both implement the same spec. One is be a code derivative of the other The quality of host kernel timers directly determines the quality of Both implement the same spec. The kernel of course needs to handle all Most features (example: npt) are transparent to userspace, some are not. When they are not, we introduce an ioctl() to kvm for controlling Both implement the same spec. Note qemu is not an emulator but a binary kvm needs to support direct mapping when possible and efficient data transfer when not. The latter will obviously be much slower. When direct mapping is possible, kvm needs to track pages touched by the guest to avoid full screen redraws. The rest (interfacing to X or vnc, implementing emulated hardware acceleration, full-screen mode, etc.) are Not at all. kvm in fact knows nothing about vga, to take your last example. To suggest that qemu needs to be close to the kernel to benefit from the kernel's timer implementation means we don't care about providing quality timing except to ourselves, which luckily isn't the case. Some time ago the various desktops needed directory change notification, and people implemented inotify (or whatever it's called today). No one That's true, but we don't have issues at the qemu/kvm boundary. Note we do have issues at the qemu/aio interfaces and qemu/net interfaces (out of which vhost-net was born) but these wouldn't be solved by tools/qemu/. -- error compiling committee.c: too many arguments to function --
So is your point that the development process and basic code structure does not matter at all, it's just a matter of people sending patches? I beg to Those bits of Fedora which deeply relate to the kernel - yes. Actually, it works much better if, contrary to your proposal it ends up in a single repo. Last i checked both of us really worked on such a project, run by You are obviously arguing for something like UML. Fortunately KVM is not that. Look at the VGA dirty bitmap optimization a'ka the KVM_GET_DIRTY_LOG ioctl. See qemu/kvm-all.c's kvm_physical_sync_dirty_bitmap(). It started out as a VGA optimization (also used by live migration) and even today it's mostly used by the VGA drivers - albeit a weak one. I wish there were stronger VGA optimizations implemented, copying the dirty bitmap is not a particularly performant solution. (although it's certainly better than full emulation) Graphics performance is one of the more painful That is not what i said. I said they are closely related, and where technologies are closely related, project proximity turns into project You are misconstruing and misrepresenting my argument - i'd expect better. Gnome and KDE runs on other kernels as well and is generally not considered close to the kernel. That was not what i suggested. They would be solved by what i proposed: tools/kvm/, right? Thanks, Ingo --
The development process of course matters, and we have worked hard to fix qemu's. Basic code structure also matters, but you don't fix that Well, when last I sent x86 patches, they went to you and hpa, applied to tip, from which I had to merge them back. Two repositories. After several weeks they did end up in a third repository, Linus'. The The VGA dirty bitmap is 256 bytes in length. Copying it doesn't take any time at all. People are in fact working on a copy-less dirty bitmap solution, for live migration of very large memory guests. Expect set_bit_user() If you have suggestions for further optimizations (or even patches) I'd love to hear them. One solution we are working on is QXL, a framebuffer-less graphics card designed for spice. The use case is again server based (hosted I really don't see how. So what if both qemu and kvm implement an i8254? They can't share any code since the internal APIs are so different. Even worse for the x86 emulator as qemu and kvm are fundamentally different. Even more with the qemu timers and kernel The vast majority of qemu has nothing to do with kvm, all the kvm interface bits are in two files. Things like the GUI, the VNC server, IDE emulation, the management interface (the monitor), live migration, qcow2 and ~15 other file format drivers, chipset emulation, USB controller emulation, snapshot support, slirp, serial port emulation, If they were, it would be worth it. -- error compiling committee.c: too many arguments to function --
I wouldnt jump to assumptions there. perf shares some facilities with the kernel on the source code level - they can be built both in the kernel and in user-space. But my main thought wasnt even to actually share the implementation - but to actually synchronize when a piece of device emulation moves into the kernel. It is arguably bad for performance in most cases when Qemu handles a given device - so all the common devices should be kernel accelerated. The version and testing matrix would be simplified significantly as well: as So is it your argument that the difference and the duplication in x86 instruction emulation is a good thing? You said it some time ago that the kvm x86 emulator was very messy and you wish it was cleaner. While qemu's is indeed rather different (it's partly a translator/JIT), i'm sure the decoder logic could be shared - and qemu has a slow-path full-emulation fallback in any case, which is similar to what in-kernel emulator does (IIRC ...). That might have changed meanwhile. Ingo --
So, you propose to allow running tools/kvm/ only on the kernel it was shipped with? Of course it isn't a good thing, but it is unavoidable. Qemu compiles code just-in-time to avoid interpretation overhead, while kvm emulates one instruction at a time. No caching is possible, especially with ept/npt, since the guest is free to manipulate memory with no notification to the host. Qemu also supports the full instruction set while kvm only implements what is necessary. Qemu is a IIUC it only ever translates. -- error compiling committee.c: too many arguments to function --
It is, because testing is more focused and more people are testing the combination that developers tested as well. (and not some random version combination picked by the distributor or the user) Ingo --
We have to maintain a dirty bitmap because we don't have a paravirtual graphics driver. IOW, someone needs to write an Xorg driver. Ideally, we could just implement a Linux framebuffer device, right? Well, we took that approach in Xen and that sucks even worse because the Xorg framebuffer driver doesn't implement any of the optimizations that the Linux framebuffer supports and the Xorg driver does not provide use the kernel's interfaces for providing update regions. Of course, we need to pull in X into the kernel to fix this, right? Any sufficiently complicated piece of software is going to interact with a lot of other projects. The solution is not to pull it all into one massive repository. It's to build relationships and to find ways to efficiently work with the various communities. And we're working on this with X. We'll have a paravirtual graphics driver very soon. There are no magic solutions. We need more developers working on the hard problems. Regards, Anthony Liguori --
No, you'd want to interact with DRM. ( Especially as you want to write guest accelerators passing guest-space OpenGL requests straight to the kernel DRM level. ) Especially if you want to do things like graphics card virtualization, with aspects of the graphics driver passed through to the guest OS. There are all kernel space projects, going through Xorg would be a horrible waste of performance for full-screen virtualization. It's fine for the windowed or networked case (and good as a compatibility fallback), but very FYI, this part of X has already been pulled into the kernel, it's called DRM. That's my whole point with this thread: the kernel side of KVM and qemu, but all practical purposes should not be two 'separate communities'. They should be one and the same thing. Separation makes sense where the relationship is light or strictly hierarchical - here it's neither. KVM and Qemu is interconnected, quite The thing is, writing up a DRM connector to a guest Linux OS could be done in no time. It could be deployed to users in no time as well, with the proper development model. That after years and years of waiting proper GX support is _still_ not implemented in KVM is really telling of the efficiency of development based on such disjoint 'communities'. Maybe put up a committee as well to increase efficiency? ;-) Ingo --
I don't think I've ever used full-screen mode with my VMs and I use virtualization on a daily basis. I don't see any actual KVM developer complaining about this so I'm not We lose a huge amount of users and contributors if we put QEMU in the Linux kernel. As I said earlier, a huge number of our contributions We've tried to create a "clean" version of QEMU specifically for KVM. Moving it into tools/kvm would be the second step. We've all failed on If the problem is combining the two, I've sent you a patch that you can put into tip.git if you're so inclined. Regards, --
Sorry for getting slightly off-topic but I find the above statement interesting.
I don't use virtualization on daily basis but a working, fully
integrated full-screen model with VirtualBox was the only reason I
bothered to give VMs a second chance. From my point of view, the user
experience of earlier versions (e.g. Parallels) was just too painful
to live with.
/me crawls back to his hole now...
Pekka
--
That's the same i do, and that's what i'm hearing from other desktop users as well. The moment you work seriously in a guest OS you often want to switch to it full-screen, to maximize screen real-estate and to reduce host GUI element distractions. If it's just casual use of a single app then windowed mode suffices (but in that case performance doesnt matter much to begin with). I find the 'KVM mostly cares about the server, not about the desktop' attitude /me should do that too - this discussion is not resulting in any positive result so it has become rather pointless. Ingo --
It's not kvm, just it's developers (and their employers, where applicable). If you post desktop oriented patches I'm sure they'll be welcome. -- error compiling committee.c: too many arguments to function --
Just such a patch-set was posted in this very thread: 'perf kvm'. There were two negative reactions immediately, both showed a fundamental server versus desktop bias: - you did not accept that the most important usecase is when there is a single guest running. - the reaction to the 'how do we get symbols out of the guest' sub-question was, paraphrased: 'we dont want that due to <unspecified> security threat to XYZ selinux usecase with lots of guests'. Anyone being aware of how Linux and KVM is being used on the desktop will know how detached that attitude is from the typical desktop usecase ... Usability _never_ sucks because of lack of patches or lack of suggestions. I bet if you made the next server feature contingent on essential usability fixes they'd happen overnight - for God's sake there's been 1000 commits in the last 3 months in the Qemu repository so there's plenty of manpower... Usability suckage - and i'm not going to be popular for saying this out loud - almost always shows a basic maintainer disconnect with the real world. See your very first reactions to my 'KVM usability' observations. Read back your and Anthony's replies: total 'sure, patches welcome' kind of indifference. It is _your project_, not some other project down the road ... So that is my first-hand experience about how you are welcoming these desktop issues, in this very thread. I suspect people try a few times with suggestions, then get shot down like our suggestions were shot down and then give up. Ingo --
When I review a patch, I try to think of the difficult cases, not just First of all I am not a qemu maintainer. Second, from my point of view all contributors are volunteers (perhaps their employer volunteered them, but there's no difference from my perspective). Asking them to repaint my apartment as a condition to get a patch applied is abuse. If I could drop everything and write a gtk GUI for qemu. Is that what you want? If someone is truly interested in a qemu usability, it's up to them to write the patches. Personally I've never missed the eject button. As to disconnect from the real world, most products based on kvm and qemu (and Linux) are server based. Perhaps that's the reason people emphasise that? Maybe if Linux had 10-20% desktop market penetration, I don't recall anyone trying this much less being shot down. Perhaps people are concentrating on virt-manager and the like and leaving qemu alone. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Erm, my usability points are _doubly_ true when there are multiple guests ... The inconvenience of having to type: perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms \ --guestmodules=/home/ymzhang/guest/modules top is very obvious even with a single guest. Now multiply that by more guests ... The crux is: we are working on improving KVM instrumentation. There are working patches posted to this thread and we would like to have/implement an automatism to allow the discovery of all this information. The information should be available to the developer who wants it, and easily/transparently so You havent articulated an actionable reason and you have suggested no solution either, you just passive-agressive backed the claim that giving developers access to the symbol space is some sort of vague 'security threat'. That is the crux of the matter. My experience in these threads was that no-one really seems to feel in charge of the whole thing. Should we really wonder why This is one of the weirdest arguments i've seen in this thread. Almost all the time do we make contributions conditional on the general shape of the project. Developers dont get to do just the fun stuff. This is a basic quid pro quo: new features introduce risks and create additional workload not just to the originating developer but on the rest of the community as well. You should check how Linus has pulled new features in the past 15 years: he very much requires the existing code to first be top-notch before he accepts new features for a given area of functionality. Doing that and insisting on developers to see those imbalances as well is absolutely essential to code quality: otherwise everyone would be running around implementing just the features they are interested in, without regard for the general health of the project. Of course, if you keep the project in two halves (KVM and Qemu), and pretend that they are separate and have little relation, ...
If you want to improve this, you need to do the following: 1) Add a userspace daemon that uses vmchannel that runs in the guest and can fetch kallsyms and arbitrary modules. If that daemon lives in tools/perf, that's fine. 2) Add a QMP interface in qemu to interact with such daemon 3) Add a default QMP port in a well known location[1] 4) Modify the perf tool to look for a default QMP port. In the case of a single guest, there's one port. If there are multiple guests, then you will have to connect to each port, find the name or any other identifying information, and let the user choose. Patches are certainly welcome. [1] I've written up this patch and will send it out some time today. Regards, Anthony Liguori --
Adding any new daemon to an existing guest is a deployment and usability nightmare. The basic rule of good instrumentation is to be transparent. The moment we have to modify the user-space of a guest just to monitor it, the purpose of transparent instrumentation is defeated. That was one of the fundamental usability mistakes of Oprofile. There is no 'perf' daemon - all the perf functionality is _built in_, and for very good reasons. It is one of the main reasons for perf's success as well. Now Qemu is trying to repeat that stupid mistake ... So please either suggest a different transparent solution that is technically better than the one i suggested, or you should concede the point really. Please try think with the heads of our users and developers and dont suggest some weird ivory-tower design that is totally impractical ... And no, you have to code none of this, we'll do all the coding. The only thing we are asking is for you to not stand in the way of good usability ... Thanks, Ingo --
Absolutely. In most cases it is not desirable, and you'll find that in a lot of cases it is not even possible - for non-technical reasons. One of the main benefits of virtualization is the ability to manage and Not to mention Heisenbugs and interference. Cheers --
Correct. Frankly, i was surprised (and taken slightly off base) by both Avi and Anthony suggesting such a clearly inferior "add a demon to the guest space" solution. It's a usability and deployment non-starter. Furthermore, allowing a guest to integrate/mount its files into the host VFS space (which was my suggestion) has many other uses and advantages as well, beyond the instrumentation/symbol-lookup purpose. So can we please have some resolution here and move on: the KVM maintainers should either suggest a different transparent approach, or should retract the NAK for the solution we suggested. We very much want to make progress and want to write code, but obviously we cannot code against a maintainer NAK, nor can we code up an inferior solution either. Thanks, Ingo --
It's only clearly inferior if you ignore every consideration against it. It's definitely not a deployment non-starter, see the tons of daemons that come with any Linux system. The basic ones are installed So long as you define 'transparent' as in 'only the guest kernel is involved' or even 'only the guest and host kernels are involved' we aren't going to make a lot of progress. I oppose shoving random bits of functionality into the kernel, especially things that are in daily use. While us developers do and will use profiling extensively, it doesn't You haven't heard any NAKs, only objections. If we discuss things perhaps we can achieve something that works for everyone. If we keep turning the flames higher that's unlikely. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Avi, please dont put arguments into my mouth that i never made.
My (clearly expressed) argument was that:
_a new guest-side demon is a transparent instrumentation non-starter_
What is so hard to understand about that simple concept? Instrumentation is
good if it's as transparent as possible.
Of course lots of other features can be done via a new user-space package ...
Thanks,
Ingo
--
Sorry, that was not the intent. I meant that putting things into the I believe you can deploy this daemon via a (default) package, without any hassle to users. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
FWIW, there's no reason you couldn't consume a vmchannel port from within the kernel. I don't think the code needs to be in the kernel and from a security PoV, that suggests that it should be in userspace IMHO. But if you want to make a kernel thread, knock yourself out. I have no objection to that from a qemu perspective. I can't see why Avi would mind either. I think it's papering around another problem (the kernel should control initrds IMHO) but that's a different topic. Regards, Anthony Liguori --
The logical conclusion of that is that everything should be built into the kernel. Where a failure brings the system down or worse. Where you have to bear the memory footprint whether you ever use the functionality or not. Where to update the functionality you need to deploy a new kernel (possibly introducing unrelated bugs) and reboot. If userspace daemons are such a deployment and usability nightmare, inetd.d style 'drop a listener config here and it will be executed on connection' should work. The listener could come with the kernel package, though I don't think it's a good idea. module-init-tools Thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Which userspace? Deploying *anything* in the guest can be a nightmare, including paravirt drivers if you don't have a natively supported in the OS virtual hardware backoff. Deploying things in the host OTOH is business as usual. And you're smart enough to know that. OG. --
That includes the guest kernel. If you can deploy a new kernel in the Thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
That's not always true. The host admin can control the guest kernel via "kvm -kernel" easily enough, but he may or may not have access to the disk that is used in the guest. (think encrypted disks, service agreements, etc) --
There is a matching -initrd argument that you can use to launch a daemon. I believe that -kernel use will be rare, though. It's a lot easier to keep everything in one filesystem. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I thought this discussion was about making it easy to deploy... and generating a custom initrd isn't easy by any means, and it requires Well, for what it's worth, I rarely ever use anything else. My virtual disks are raw so I can loop mount them easily, and I can also switch my guest kernels from outside... without ever needing to mount those disks. --
That's true. You need to run mkinitrd anyway, though, unless your guest Curious, what do you use them for? btw, if you build your kernel outside the guest, then you already have access to all its symbols, without needing anything further. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
There's two errors with your argument:
1) you are assuming that it's only about kernel symbols
Look at this 'perf report' output:
# Samples: 7127509216
#
# Overhead Command Shared Object Symbol
# ........ .......... ............................. ......
#
19.14% git git [.] lookup_object
15.16% perf git [.] lookup_object
4.74% perf libz.so.1.2.3 [.] inflate
4.52% git libz.so.1.2.3 [.] inflate
4.21% perf libz.so.1.2.3 [.] inflate_table
3.94% git libz.so.1.2.3 [.] inflate_table
3.29% git git [.] find_pack_entry_one
3.24% git libz.so.1.2.3 [.] inflate_fast
2.96% perf libz.so.1.2.3 [.] inflate_fast
2.96% git git [.] decode_tree_entry
2.80% perf libc-2.11.90.so [.] __strlen_sse42
2.56% git libc-2.11.90.so [.] __strlen_sse42
1.98% perf libc-2.11.90.so [.] __GI_memcpy
1.71% perf git [.] decode_tree_entry
1.53% git libc-2.11.90.so [.] __GI_memcpy
1.48% git git [.] lookup_blob
1.30% git git [.] process_tree
1.30% perf git [.] process_tree
0.90% perf git [.] tree_entry
0.82% perf git [.] lookup_blob
0.78% git [kernel.kallsyms] [k] kstat_irqs_cpu
kernel symbols are only a small portion of the symbols. (a single line in this
case)
To get to those other symbols we have to read the ELF symbols of those
binaries in the guest filesystem, in ...Okay. So a symbol server is necessary. Still, I don't think -kernel is a good reason for including the symbol server in the kernel itself. If someone uses it extensively together with perf, _and_ they can't put the symbol server in the guest for some reason, let them patch mkinitrd to What about line number information? And the source? Into the kernel I've read every one of your emails. If I misunderstood or overlooked something, I apologize. The thread is very long and at times antagonistic so it's hard to keep all the details straight. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Sigh. Please read the _very first_ suggestion i made, which solves all that. I
rarely go into discussions without suggesting technical solutions - i'm not
interested in flaming, i'm interested in real solutions.
Here it is, repeated for the Nth time:
Allow a guest to (optionally) integrate its VFS namespace with the host side
as well. An example scheme would be:
/guests/Fedora-G1/
/guests/Fedora-G1/proc/
/guests/Fedora-G1/usr/
/guests/Fedora-G1/.../
/guests/OpenSuse-G2/
/guests/OpenSuse-G2/proc/
/guests/OpenSuse-G2/usr/
/guests/OpenSuse-G2/.../
( This feature would be configurable and would be default-off, to maintain
the current status quo. )
Line number information and the source (dwarf info) and ELF symbols are all
provided and accessible via such an interface - no need to run any 'symbol
demon' on the guest side.
And, obviously, having the guest VFS namespace (optionally) available on the
host side also has far more uses than perf's symbol needs.
I was surprised no-one ever came up with such a suggestion - it is so obvious
to allow the integration of the VFS namespaces. But given your explicit
declaration of your KVM desktop usability indifference i'm kind of not
surprised about that anymore.
Thanks,
Ingo
--
Heh, funny. That would also solve my number one gripe with
virtualization these days: how to get files in and out of guests
without having to install extra packages on the guest side and
fiddling with mount points on every single guest image I want to play
with.
Pekka
--
FYI, for offline guests, you can use libguestfs[1] to access & change files inside the guest, and read-only access to running guests files. It provides access via a interactive shell, APIs in all major languages, and also has a FUSE mdule to expose it directly in the host VFS. It could probably be made to work read-write for running guests too if its agent were installed inside the guest & leverage the new Virtio-Serial channel for comms (avoiding any network setup requirements). Regards, Daniel [1] http://libguestfs.org/ -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
Hi Daniel, (I'm getting slightly off-topic, sorry about that.) Right. Thanks for the pointer. The use case I am thinking of is working on an userspace project and wanting to test a piece of code on multiple distributions before pushing it out. That pretty much means being able to pull from the host git repository (or push to the guest repo) while the guest is running, maybe changing the code a bit and then getting the changes back to the host for the final push. What I do now is I push the changes on the host side to a (private) remote branch and do the work through that. But that's pretty lame workaround in my opinion. Pekka --
Yes, this is the kind of functionality i'm suggesting. I'd suggest a different implementation for live guests: to drive this from within the live guest side of KVM, i.e. basically a paravirt driver for guestfs. You'd pass file API guests to the guest directly, via the KVM ioctl or so - and get responses from the guest. That will give true read-write access and completely coherent (and still transparent) VFS integration, with no host-side knowledge needed for the guest's low level (raw) filesystem structure. That's a big advantage. Yes, it needs an 'aware' guest kernel - but that is a one-off transition overhead whose cost is zero in the long run. (i.e. all KVM kernels beyond a given version would have this ability - otherwise it's guest side distribution transparent) Even 'offline' read-only access could be implemented by booting a minimal kernel via qemu -kernel and using a 'ro' boot option. That way you could eliminate all lowlevel filesystem knowledge from libguestfs. You could run ext4 or btrfs guest filesystems and FAT ones as well - with no restriction. This would allow 'offline' access to Windows images as well: a FAT or ntfs enabled mini-kernel could be booted in read-only mode. Thanks, Ingo --
This is close to the way libguestfs already works. It boots QEMU/KVM pointing to a minimal stripped down appliance linux OS image, containing a small agent it talks to over some form of vmchannel/serial/virtio-serial device. Thus the kernel in the appliance it runs is the only thing that needs to know about the filesystem/lvm/dm on-disk formats - libguestfs definitely does not want to be duplicating this detailed knowledge of on disk format itself. It is doing full read-write access to the guest filesystem in offline mode - one of the major use cases is disaster recovery from a unbootable guest OS image. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
As Dan said, the 'daemon' part is separate and could be run as a standard part of a guest install, talking over vmchannel to the host. The only real issue I can see is adding access control to the daemon (currently it doesn't need it and doesn't do any). Doing it this way you'd be leveraging the ~250,000 lines of existing libguestfs code, bindings in multiple languages, tools etc. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones New in Fedora 11: Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 70 libraries supprt'd http://fedoraproject.org/wiki/MinGW http://www.annexia.org/fedora_mingw --
I think it would be a nice option to allow such guest-side "daemon's" to be executed in the guest context without _any_ guest-side support. This would be possible by building such minimal daemons that use vmchannel, and which are built for generic x86 (maybe even built for 32-bit x86 so that they can run on any x86 distro). They could execute as the init task of any guest kernel - Qemu could 'blend in / replace' the binary as the init task of the guest temporarily - and some simple bootstrap code could then start the daemon and start the real init binary (and turn off the 'blending' of the init task). That way any guest could be extended via such Qemu functionality - even without any kernel changes. Has anyone thought about (or coded) such a solution perhaps? Ingo --
I think we don't need per-guest-file access control. Probably we could apply the image-file permissions to all guestfs files. This would cover the usecases: * perf for reading symbol information (needs ro-access only anyway) * Desktop like host<->guest file copy I have not looked into libguestfs yet but I guess this approach is easier to achieve. Joerg --
[ Oops, you are right - sorry for not looking more closely! I was confused by Just curious: any plans to extend this to include live read/write access as well? I.e. to have the 'agent' (guestfsd) running universally, so that tools such as perf and by users could rely on the VFS integration as well, not just disaster recovery tools? Without universal access to this feature it's not adequate for instrumentation purposes. One option to achieve that would be to extend Qemu to allow 'qemu daemons' to run on the (Linux) guest side. These would be statically linked binaries that can run on any Linux system, and which could provide various built-in Qemu functionality from the guest side to the host side. Thanks, Ingo --
By default i'd suggest to put it into a maximally restricted mount point. I.e. restrict access to only the security context running libguestfs or so. ( Which in practice will be the user starting the guest, so there will be proper protection from other users while still allowing easy access to the user that has access already. ) Ingo --
Totally. That's not to say there is a definite plan, but we're very open to doing this. We already wrote the daemon in such a way that it doesn't require the appliance part, but could run inside any existing guest (we've even ported bits of it to Windoze ...). The only remaining issue is how access control would be handled. You obviously wouldn't want anything in the host that can get access to the vmchannel socket to start sending destructive write commands into guests. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://et.redhat.com/~rjones/virt-df/ --
[...] You're missing something. This sub-thread is about someone launching a kernel with 'qemu -kernel', the kernel lives outside the guest disk image, they don't want a custom initrd because it's hard to make. -- error compiling committee.c: too many arguments to function --
Well, you know, I am missing your point here about initrd. Surely the
guest kernels need to use sys_mount() at some point at which time they
could just tell the host kernel where they can find the mount points?
But maybe we're not talking about that kind of scenario here?
Pekka
--
Above example shows perf could summarize both kernel and application hot functions. If we collect guest os statistics from host side, we can't summarize detailed guest os application info because we couldn't get guest os's application process id from host side. So we could only get detailed kernel info and the total utilization percent of --
Various things, here is one use case which I think is under-used: read-only virtual disks with just one network application on them (no runlevels, sshd, user accounts, etc), a hell of a lot easier to maintain and secure than a full blown distro. Want a new kernel? boot a new VM and swap it for the old one with zero downtime (if your network app supports this sort of hot-swap - which a lot of cluster apps do) Another reason for wanting to keep the kernel outside is to limit the potential points of failure: remove the partition table, remove the bootloader, remove even the ramdisk. Also makes it easier to switch to another solution (say UML) or another disk driver (as someone mentioned previously). In virtualized environments I often prefer to remove the ability to load kernel modules too, for obvious reasons. Hope this helps. Antoine --
Note that with perf we can instrument the guest with zero guest-kernel modifications as well. We try to reduce the guest impact to a bare minimum, as the difficulties in deployment are function of the cross section surface to the guest. Also, note that the kernel is special with regards to instrumentation: since this is the kernel project, we are doing kernel space changes, as we are doing them _anyway_. So adding symbol resolution capabilities would be a minimal addition to that - while adding a while new guest package for the demon would significantly increase the cross section surface. Ingo --
It's true that for us, changing the kernel is easier than changing the rest of the guest. IMO we should still resist the temptation to go the easy path and do the right thing (I understand we disagree about what the right thing is). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
It is not about the 'temptation to go the easy path'. It is about finding the most pragmatic approach and realizing the cost of inaction: sucky Linux, sucky KVM. Let me give you an example: Linus's commit in v2.6.30 that changed the user-space policy of the EXT3 filesystem to make it more desktop capable: bbae8bc: ext3: make default data ordering mode configurable That changes was opposed vehemently with your kind of arguments: "such changes should be done by the distributions", "it should be done correctly", "the kernel should not implement policy", etc.. I can also tell you that this commit improved my desktop experience incredibly. Still, distros didnt do it for almost a decade of ext3 existence. Why? Truth is that those kinds of "do it right" arguments are mistaken because they assume that we live in an ideal, 'perfect market' where all inefficiencies will get eliminated in the long run. In reality the "market" for OSS software is imperfect: - there's marginal costs of action - a too small change has difficulty getting over that - there's costs of modularization (which are both technical and social) - there's the power of the status quo acting against marginally good changes - there's the power of entropy ripping Linux distributions apart making all-distro changes harder So the solution to the "why dont the distributions do this" question you pose is exactly what i propose: _give a default, reference implementation of KVM tooling that has to be eclipsed_. There's the unique position of the kernel that it can impose sanity in a more central way which acts as a reference implementation. I.e. the kernel can very much improve quality all across the board by providing a sane default (in the ext3 case) - or, as in the case of perf, by providing a sane 'baseline' tooling. It should do the same for KVM as well. If we dont do that, Linux will eventually stop mattering on the desktop - and some time after that, it will ...
Yet Linux is gaining ground in the server and embedded space while struggling on the desktop. Apple is gaining ground on the desktop but is invisible on the server side (despite having a nice product - Xserve). It's true Windows achieved server dominance through it's desktop power, but I don't think that's what keeping them there now. In any case, I'm not going to write a kvm GUI. It doesn't match my skills, interest, or my employer's interest. If you wish to see a kvm GUI you have to write one yourself or convince someone to write it (perhaps convince Red Hat to fund such an effort beyond virt-manager). -- error compiling committee.c: too many arguments to function --
It is planned to add support for SPICE remote desktop to virt-manager once that matures & is accepted into upstream KVM/QEMU. That will improve the guest/desktop interaction in many ways compared to VNC or SDL, with improved display resolution changing, copy+paste between host & guest, much better graphics performance, etc. Development efforts aren't totally ignoring the desktop, more that they are focusing on remoting guest desktops, rather than interaction host desktop since that's where alot of demand is. This benefits single host desktops scenarios too, since there's alot of overlap in the problems faced there. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
Frankly, Linux is mainly growing in the server space due to:
1) the server space is technically much simpler than the desktop space. It
is far easier to code up a server performance feature than to make
struggle through stupid (server-motivated) package boundaries and get
something done on the desktop. It is far easier to code up a server app
as that space is well standardized and servers tend to be compartmented.
Integration between server apps is much less common than integration
between desktop apps, hence the harm that our modularization idiocies
cause less harm.
2) Linux's growth is still feeding on the remains of the destruction of Unix.
Linux is struggling on the desktop due to the desktop's inherent complexity,
due to the lack of the Unix inertia and due to incompetence, insensitivity,
intellectual arrogance and shortsightedness of server-centric thinking, like
But the thing is, Apple doesnt really care about the server space, yet. It is
lucrative but it is a side-show: it will fall automatically to the 'winner' of
the desktop (or gadget) of tomorrow.
Has the quick fall of Banyan Vines or Netware (both excellent all-around
server products) taught you nothing?
We need a lot more desktop focus in the kernel community. The best method to
achieve this, that i know of currently, is to simply have kernel developers
think outside the kernel box and to have them do bits of user-space coding as
well - and in particular desktop coding. To eat our own dogfood in essence.
Suffer through crap we cause to user-space. To face the _real_ difficulties of
As a maintainer you certainly dont have to write a single line of code, if you
dont want to. You 'just' need to care about the big picture and encourage/help
the flow and balance of the whole project.
Ingo
--
Agreed (minus the 'package boundaries' stuff). Also, Linux is cheaper It's struggling because it isn't competitive technically with other desktops, because there is no application base, because of a chicken-and-egg problem with some drivers, because lack of a stable ABI means you can't get a driver CD with your device so you need a yet-unreleased kernel, because the zillion binary incompatible distributions mean that application developers don't know what to code and test for, because of lack of documentation, to name a few. At least it's improving all the time. The incompetence, insensitivity, intellectual arrogance and shortsightedness of server-centric thinking of my arguments/position are It won't automatically fall to Apple, there's tons of middleware and server apps that need porting (the "ecosystem"), plus they need to work hard on improving their kernel which is desktop oriented. Looks like Not familiar with Banyan, but wasn't Netware a cooperative multitasking command line only thing? It couldn't compete with preemptive modern Try it yourself and report the experience. Note: perf is not desktop Not at all. They have excellent development tools and lots of middleware and other third party products that make it easy to pick Windows. For example, Exchange is more or less standard for groupware, and they made C# and the technology around it easy to develop for, I haven't written that line of code, and no one else has either. Don't tell me they're all scared of me. -- error compiling committee.c: too many arguments to function --
Only if you apply it as a totalitarian rule. Furthermore, the logical conclusion of _your_ line of argument (applied in a totalitarian manner) is that 'nothing should be built into the kernel'. I.e. you are arguing for microkernel Linux, while you see me as arguing for a monolithic kernel. Reality is that we are somewhere inbetween, we are neither black nor white: it's shades of grey. If we want to do a good job with all this then we observe subsystems, we see how they relate to the physical world and decide about how to shape them. We identify long-term changes and re-design modularization boundaries in hindsight - when we got them wrong initially. We dont try to rationalize the status-quo. Lets see one example of that thought process in action: Oprofile. We saw that the modularization of oprofile was a total nightmare: a separate kernel-space and a separate user-space component, which was in constant version friction. The ABI between them was stiffling: it was hard to change it (you needed to trickle that through the tool as well which was on a different release schedule, etc.e tc.) The result was sucky usability that never went beyond some basic 'you can do profiling' threshold. The subsystem worked well within that design box, and it was worked on by highly competent people - but it was still far, far away from the potential it could have achieved. So we observed those problems and decided to do something about it: - We unified the two parts into a single maintenance domain. There's the kernel-side in kernel/perf_event.c and arch/*/*/perf_event.c, plus the user-side in tools/perf/. The two are connected by a very flexible, forwards and backwards compatible ABI. - We moved much more code into the kernel, realizing that transparent and robust instrumentation should be offered instead of punting abstractions into user-space (which is in a disadvantaged position to implement system-wide abstractions). - We created a ...
I'm certainly a minimalist, but that doesn't follow. Things that require privileged access, or access to the page cache, or that can't be made to perform otherwise should certainly be in the kernel. That's why I submitted kvm for inclusion in the first place. If it's something that can work just as well in userspace but we can't be bothered to fix any 'deployment nightmares', then they shouldn't be in the kernel. Examples include lvm2 and mdadm (which truly are 'deployment nightmares' - you need to start them before you have access No. I'm arguing for reducing bloat wherever possible. Kernel code is I'm not for the status quo either - I'm for reducing the kernel code That's useful because perf is still small. If it were a full fledged 350KLOC GUI, then most of the development would concentrate on the GUI and very little (relatively) would have to do with the kernel. Qemu is in that state today. Please, please look at the recent commits and check how many have actually anything to do with kvm, and how many No argument. I have a similar experience with kvm. The user/kernel break is at the cpu virtualization level - that is kvm is solely responsible for emulating a cpu and userspace is responsible for emulating devices. An exception was made for the PIC/IOAPIC/PIT due to performance considerations - they are emulated in the kernel as well. A common FAQ is why do we not emulate real-mode instructions in qemu. The answer is that it the interface to kvm would be insane - it would emulate a partial cpu. All other users of that interface would have to implement an emulator (there is also a practical argument - the qemu Excellent. However qemu is written by developers for their users, and their users are not worried about an eject button in the qemu SDL interface, or about running the qemu command line by hand. They have complicated management interfaces that do everything, so we concentrate, for example, on a robust RPC interface ...
1) One of the primary design arguments of the micro-kernel design as well was to push as much into user-space as possible without impacting performance too much so you very much seem to be arguing for a micro-kernel design for the kernel. I think history has given us the answer for that fight between microkernels and monolithic kernels. Furthermore, to not engage in hypotheticals about microkernels: by your argument the Oprofile design was perfect (it was minimalistic kernel-space, with all the complexity in user-space), while perf was over-complex (which does many things in the kernel that could have been done in user-space). Practical results suggest the exact opposite happened - Oprofile is being replaced by perf. How do you explain that? 2) In your analysis you again ignore the package boundary costs and artifacts as if they didnt exist. That was my main argument, and that is what we saw with oprofile and perf: while maintaining more kernel-code may be more expensive, it sure pays off for getting us a much better solution in the end. And getting a 'much better solution' to users is the goal of all this, isnt it? I dont mind what you call 'bloat' per se if it's for a purpose that users find like a good deal. I have quite a bit of RAM in most of my systems, having 50K more or less included in the kernel image is far less important than having a healthy and vibrant development model and having satisfied users ... Ingo --
I am not arguing for a microkernel. Again: reduce bloat where possible, I did not say that the amount of kernel and userspace code is the only factor deciding the quality of software. If that were so, microkernels would have won out long ago. It may be that that perf has too much kernel code, and won against oprofile despite that because it was better in other areas. Or it may be that perf has exactly the right user/kernel division. Or maybe perf needs some of the code moved from userspace to the kernel. I don't know, I haven't examined the code. The user/kernel boundary is only one metric for code quality. Nor is it always in favour of pushing things to userspace. Narrowing or simplifying an interface is often an argument in favour of pushing things into the kernel. IMO the reason perf is more usable than oprofile has less to do with the kernel/userspace boundary and more do to with effort and attention spent Package costs are real. We need to bear them. I don't think that because maintaining another package (and the interface between two I'm not worried about 50K or so, I'm worried about a bug in those 50K taking down the guest. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
If you are interested in the first-hand experience of the people who are doing the perf work then here it is: by far the biggest reason for perf success and perf usability is the integration of the user-space tooling with the kernel-space bits, into a single repository and project. The very move you are opposing so vehemently for KVM. Oprofile went the way you proposed, and it was a failure. It failed not because it was bad technology (it was pretty decent and people used it), it was not a failure because the wrong people worked on it (to the contrary, very capable people worked on it), it was a failure in hindsight because it simply incorrectly split into two projects which stiffled the progress of each other. Obviously 3 years ago you'd have seen a similar, big "Oprofile is NOT broken!" flamewar, had i posted the same observations about Oprofile that i expressed about KVM here. (In fact there was a similar, big flamewar about all this when perf was posted a year ago.) And yes, (as you are aware of) i see very similar patterns of inefficiency in the KVM/Qemu tooling relationship as well, hence did i express my views about it. Thanks, Ingo --
Please take a look at the kvm integration code in qemu as a fraction of Every project that has some kernel footprint, except perf, is split like that. Are they all failures? Seems like perf is also split, with sysprof being developed outside the kernel. Will you bring sysprof into the kernel? Will every feature be duplicated in prof and sysprof? -- error compiling committee.c: too many arguments to function --
Hi Avi,
I am glad you brought it up! Sysprof was historically outside of the
kernel (with it's own kernel module, actually). While the GUI was
nice, it was much harder to set up compared to oprofile so it wasn't
all that popular. Things improved slightly when Ingo merged the custom
kernel module but the _userspace_ part of sysprof was lagging behind a
bit. I don't know what's the situation now that they've switched over
to perf syscalls but you probably get my point.
It would be nice if the two projects merged but I honestly don't see
any fundamental problem with two (or more) co-existing projects.
Friendly competition will ultimately benefit the users (think KDE and
Gnome here).
Pekka
--
See my previous mail - what i see as the most healthy project model is to have a full solution reference implementation, connected to a flexible halo of plugins or sub-apps. Firefox does that, KDE does that, and Gnome as well to a certain degree. The 'halo' provides a constant feedback of new features, and it also provides competition and pressure on the 'main' code to be top-notch. The problem i see with KVM is that there's no reference implementation! There is _only_ the KVM kernel part which is not functional in itself. Surrounded by a 'halo' - where none of the entities is really 'the' reference implementation we call 'KVM'. This causes constant quality problems as the developers of the main project dont have constant pressure towards good quality (it is not their responsibility to care about user-space bits after all), plus it causes a lack of focus as well: integration between (friendly) competing user-space components is a lot harder than integration within a single framework such as Firefox. I hope this explains my points about modularization a bit better! I suggested KVM to grow a user-space tool component in the kernel repo in tools/kvm/, which would become the reference implementation for tooling. User-space projects can still provide alternative tooling or can plug into this tooling, just like they are doing it now. So the main effect isnt even on those projects but on the kernel developers. The ABI remains and all the user-space packages and projects remain. Yes, i thought Qemu would be a prime candidate to be the baseline for tools/kvm/, but i guess that has become socially impossible now after this flamewar. It's not a big problem in the big scheme of things: tools/kvm/ is best grown up from a small towards larger size anyway ... Thanks, Ingo --
The reference implementation is qemu-kvm.git, in the future qemu.git. Like the reference implementation of device-mapper is The developers of the main project are very much aware that users don't Seems like wanton duplication of effort. Can we throw so many developer-years away on duplicate projects? Assuming not all are true Qemu is open source, you can cp it into tools/kvm. Rewriting it from scratch is a mammoth effort, there's a reason kvm, Xen, and virtualbox all use qemu. Qemu itself copied code from bochs. Writing this stuff is hard, especially if there is something already working. You'll probably get much better threading (the qemu device model is still single threaded), but it will take years to reach where qemu is already at. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I'm curious, where would you put the limit? Let's imagine a tools/kvm appears, be it qemu or not, that's outside the scope of my question. Would you put the legacy PC bios in there (seabios I guess)? The EFI bios? The windows-compiled paravirtual drivers? The Xorg paravirtual DDX ? Mesa (which includes the pv gallium drivers)? The libvirt-equivalent? The GUI? That's not a rhetorical question btw, I really wonder where the limit should be. OG. --
You have to admit that much of Qemu's past 2-3 years of development was motivated by Linux/KVM (i'd say more than 50% of the code). As such it's one and the same code base - you just continue to define Qemu to be different from KVM. I very much remember how Qemu looked like _before_ KVM: it was a struggling, Would you accept (or at least not NAK) a new tools/kvm/ tool that builds tooling from grounds up, while leaving Qemu untouched? [assuming it's all clean code, etc.] Although i have doubts about how well that would work 'against' your opinion: such a tool would need lots of KVM-side features and a positive attitude from No. Did i ever claim KVM was a failure? I said it's hindered by this design aspect. I'd prefer if sysprof merged into perf as 'perf view' - but its maintainer does not want that - which is perfectly OK. So we are building equivalent functionality into perf instead. Think about it like Firefox plugins: the main Firefox project picks up the functionality of the most popular Firefox plugins all the time. Session Saver, Tab Mix Plus, etc. were all in essence 'merged' (in functionality, not in code) into the 'reference' Firefox project. I think that's a fundamentally healthy model: it allows extensions and thus give others an honest chance to show that you are potentially coding an inferior piece of code - but also express a clear opinion about what you consider a full, usable, high-quality reference implementation and constantly improve this reference implementation. I dont think that can be argued to be a bad model. Yes, it takes a bit of thinking outside the box to do tools/kvm/ but of all people i'd expect some of that from you. Ingo --
It's not the same code base. kvm provides a cpu virtualization service, qemu uses it. There could be other users. qemu could go away one day I couldn't NAK tools/kvm any more than I could NAK a new project outside the kernel repository. IMO it would be duplicated effort, but like I mentioned before, I can't tell volunteers what to do, only recommend Functionality that can be implemented in userspace will not be accepted into kvm unless there are very good reasons why it should be. Things There's a difference between absorbing a small plugin and duplicating a project. -- error compiling committee.c: too many arguments to function --
Since you are talking so much about oProfile in this thread I think it is important to mention that the problem with oProfile was not the repository separation. The problem was (and is) that the kernel and the user-space parts are maintained by different people who dont talk to each other or have a direction where they want to go with the project. Basically the reason of the oProfile failure is a disfunctional community. I told the kernel-maintainer several times to also maintain user-space but he didn't want that. The situation with KVM is entirely different. Avi commits to kvm.git and qemu-kvm.git so he maintains both. Anthony is working to integrate the qemu-kvm changes into upstream qemu. Further these people work very closely together and the community around KVM works well too. The problems that oProfile has are not even in sight for KVM. Joerg --
Caused by: repository separation and the inevitable code and social fork a Caused by: repository separation and the inevitable code and social fork a Caused by: repository separation and the inevitable code and social fork a What you fail to realise (or what you fail to know, you werent around when Oprofile was written, i was) is that Oprofile _did_ have a functional single community when it was written. The tooling and the kernel bits was written by the same people. But a decade is a long time and the drift happened due to the inevitability of the repository separation, and due to the _inability_ to reach a sane, usable solution within that framework of separation. So i dont see much of a difference to the Oprofile situation really and i see many parallels. I also see similar kinds of desktop usability problems. The difference is that we dont have KVM with a decade of history and we dont have a 'told you so' KVM reimplementation to show that proves the point. I guess it's a matter of time before that happens, because Qemu usability is so absymal today - so i guess we should suspend any discussions until that happens, no need to waste time on arguing hypoteticals. I think you are rationalizing the status quo. It's as if you argued in 1990 that the unification of East and West Germany wouldnt make much sense because despite clear problems and incompatibilites and different styles westerners were still allowed to visit eastern relatives and they both spoke the same language after all ;-) Thanks, Ingo --
No, the split-repository situation was the smallest problem after all. Its was a community thing. If the community doesn't work a single-repo project will also fail. Look at the state of the alpha arch in Linux today, it is maintained in one repository but nobody really cares about it. Thus it is miles behine most other archs Linux supports today in Yes, this was probably the time when everybody was enthusiastic about the feature and they could attract lots of developers. But situation The difference is that KVM has a working community with good developers We actually have lguest which is small. But it lacks functionality and I see that there are issues with KVM today in some areas. You pointed out the desktop usability already. I personally have trouble with the qem-kvm.git because it is unbisectable. But repository unification doesn't solve the problem here. The point for a single repository is that it simplifies the development process. I agree with you here. But the current process of KVM is not too difficult after all. I don't have to touch qemu sources for most of Um, hmm. I don't think these situations have enough in common to compare them ;-) Joerg --
I dont know how you can find the situation of Alpha comparable, which is a legacy architecture for which no new CPU was manufactored in the past ~10 years. The negative effects of physical obscolescence cannot be overcome even by the very best of development models ... So, what do you think creates code communities and keeps them alive? Developers and code. And the wellbeing of developers are primarily influenced by the repository structure and by the development/maintenance process - i.e. by the 'fun' aspect. (i'm simplifying things there but that's the crux of it.) So yes, i do claim that what stiffled and eventually killed off the Oprofile community was the split repository. None of the other Oprofile shortcomings were really unfixable, but this one was. It gave no way for the community to grow in a healthy way, after the initial phase. Features were more difficult and less fun to develop. And yes, there were times when there was still active Oprofile development but the development process warning signs should have been noticed, and the community could have been kept alive by unification and similar measures. Instead what happened was a complete rewrite and a competitive replacement by perf. (Which isnt particularly nice to users btw. - they prefer more gradual transitions - but there was no other option, so many problems accumulated in Oprofile.) I simply do not want to see KVM face the same fate, and yes i do see similar Oprofile certainly had good developers and maintainers as well. In the end it wasnt enough ... Also, a project can easily still be 'alive' but not reach its full potential. Why do you assume that my argument means that KVM isnt viable today? It can very well still be viable and even healthy - just not _as healthy_ as it could I suggested long ago to merge lguest into KVM to cover non-VMX/non-SVM Why doesnt it solve the bisectability problem? The kernel repo is supposed to In my judgement you'd have to do ...
In your very previous paragraphs, you enumerate two separate causes: "repository structure" and "development/maintenance process" as being sources of "fun". Please simply accept that the former is considered by many as absolutely trivial compared to the latter, and additional verbose repetition of your thesis will not change this. - FChE --
Hi Frank,
I can accept that many people consider it trivial but the problem is
that we have _real data_ on kmemtrace and now perf that the amount of
contributors is significantly smaller when your code is outside the
kernel repository. Now admittedly both of them are pretty intimate
with the kernel but Ingo's suggestion of putting kvm-qemu in tools/ is
an interesting idea nevertheless.
It's kinda funny to see people argue that having an external
repository is not a problem and that it's not a big deal if building
something from the repository is slightly painful as long as it
doesn't require a PhD when we have _real world_ experience that it
_does_ limit developer base in some cases. Whether or not that applies
to kvm remains to be seen but I've yet to see a convincing argument
why it doesn't.
Pekka
--
qemu has non-Linux developers. Not all of their contributions are relevant to kvm but some are. If we pull qemu into tools/kvm, we lose them. -- error compiling committee.c: too many arguments to function --
Qemu had very few developers before KVM made use of it - i know it because i followed the project prior KVM. So whatever development activitity Qemu has today, it's 99% [WAG] attributable to KVM. It might have non-Linux contributors, but they wouldnt be there if it wasnt for all the Linux contributors ... Furthermore, those contributors wouldnt have to leave - they could simply use a different Git URI ... Ingo --
tools/kvm would drop support for non-Linux hosts, for tcg, and for architectures which kvm doesn't support ("clean and minimal"). That would be the real win, not sharing the repository. But those other contributors would just stay with the original qemu. -- error compiling committee.c: too many arguments to function --
Hi Avi,
Yeah, you probably would but the hypothesis is that you'd end up with
a bigger net developer base for the _Linux_ version. Now you might not
think that's important but I certainly do and I think Ingo does as
well. ;-)
That said, pulling 400 KLOC of code into the kernel sounds really
excessive. Would we need all that if we just do native virtualization
and no actual emulation?
Pekka
--
You're probably correct, but the point is that non-Linux developers also What is native virtualization and no actual emulation? -- error compiling committee.c: too many arguments to function --
What I meant with "actual emulation" was running architecture A code on architecture B what was qemu's traditional use case. So the question was how much of the 400 KLOC do we need for just KVM on all the architectures that it supports? --
qemu is 620 KLOC. Without cpu emulation that drops to ~480 KLOC. Much of that is device emulation that is not supported by kvm now (like ARM) but some might be needed again in the future (like ARM). x86-only is perhaps 300 KLOC, but kvm is not x86 only. And that is with a rudimentary GUI. GUIs are heavy. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Yeah. Also, if in fact the claim that the 'repository does not matter' is true then it doesnt matter that it's hosted in tools/kvm/ either, right? I.e. it's a win-win situation. Worst-case nothing happens beyond a Git URI change. Best-case the project is propelled to never seen heights due to contribution advantages not contemplated and not experienced by the KVM guys before ... Ingo --
Again, the second it's moved to tools/kvm/ we strip it off anything that You're exaggerating. There were 773 commits into qemu.git (excluding qemu-kvm.git) in the past three months. 162 for the same period for tools/perf. The pool is not that deep. -- error compiling committee.c: too many arguments to function --
There is nothing fun about having one repository or two. Who cares about this anyway? tools/kvm/ probably will draw developers, simply because of the glory associated with kernel work. That's a bug, not a feature. It means that effort is not distributed according to how it's needed, but because The number of kvm and qemu developers keeps increasing. We're having a kvm forum in August where we all meet. Come and see for Rusty posted some initial patches for pv-only kvm but he lost interest before they were completed. No one followed up. btw, lguest has a single repository, userspace and kernel in the same These days qemu-kvm.git is bisectable (though not always trivially). Something I've wanted for a long time is to port kvm_stat to use tracepoints instead of the home-grown instrumentation. But that is unrelated to this new tracepoint. Other than that we're satisfied with There are plenty of un-fun tasks (like fixing bugs and providing RAS features) that we're doing. We don't do this for fun but to satisfy our users. -- error compiling committee.c: too many arguments to function --
And yet your solution to that is to ... do all your work in the kernel space Despite it being another in-kernel subsystem that by your earlier arguments So which one is it, KVM developers are volunteers that do fun stuff and cannot be told about project priorities, or KVM developers are pros who do unfun stuff because they can be told about priorities? I posit that it's both: and that priorities can be communicated - if only you try as a maintainer. All i'm suggesting is to add 'usable, unified user-space' to the list of unfun priorities, because it's possible and because it matters. Ingo --
I've spent the past few months dealing with customers using the libvirt/qemu/kvm stack. Usability is a major problem and is a top priority for me. That is definitely a shift but that occurred before you started your thread. But I disagree with your analysis of what the root of the problem is. It's a very kernel centric view and doesn't consider the interactions between userspace. Regards, --
I have done plenty of userspace work in qemu. I don't have a lack of interest in qemu, just in a desktop GUI. I'm not a GUI person and my employer doesn't have a desktop-on-desktop virtualization product that I I'm satisfied with it as a user. Architecturally, I'd have preferred it to be a userspace tool. It might have improved usability as well to have something with --help instead of a set of debugfs files. But I'm a From my point of view as maintainer, all contributors are volunteers, I can't tell any of them what to do. From the point of view of many of these volunteer's employers, they are wage slaves who do as they're told or else. So: when someone sends me a patch I gratefully accept if it is good or point out the issues if not. At the secret Red Hat headquarters and the kvm weekly conference call I participate in deciding priorities and task So: I require a volunteer to write some GUI code before I accept a patch. Back at the Red Hat lair, we think of what features we drop from the product because the kvm maintainer has gone nuts. The 'unified' part of your suggestion is not a requirement, but an implementation detail. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
IMHO blaming anybody for it but qemu maintainership is very unfair. They intentionally reinveinted a less self contained, inferior, underperforming, underfeatured wheel instead of doing the right thing and just making sure that it as self contained enough as possible to avoid risking destabilizing their existing codebase. What can anybody (without qemu git commit access) do about it unless qemu git maintainer change attitude, dumps its qemu/kvm-all.c nosense for good, and do the right thing so we can unify for real? We need to move forward, including multithread the qemu core and be ready to include desktop virtualization protocol when they're ready for submission without being suggested to extend vnc instead to gain a similar speedup (i.e. yet another inferior wheel). Unification means that _all_ qemu users, pure research, theoretical interest, Xen, virtualbox, weird pure software architecture, will be able to push their stuff in for the common good, but that also shall apply to KVM! It has to become clear that reinveinting inferior wheels instead of merging the real thing, is absolutely time wasteful, unnecessary, and it won't make any difference as far as KVM is concerned, proof is that 0% of userbase runs qemu git to run KVM (except the kvm-all.c developers to test it perhaps or somebody by mistake not adding -kvm prefix to command line maybe). I don't pretend to rate KVM as more important than all the rest of niche usages for qemu but it shall be _as_ important as the rest and it'd be nice one day to be able to install only qemu on a system and get something actually usable in production. I very much like that qemu gets contributions from everywhere, it's also nice it can run without KVM (that is purely useful as a debugging tool to me but still...). I think it can all happen and unification should be the object for the gain of everyone in both qemu/kvm and even xen and all the rest. --
The maintainers of that architecture could at least continue to maintain it. But that is not the case. Most newer syscalls are not available and overall stability on alpha sucks (kernel crashed when I tried to start Xorg for example) but nobody cares about it. Hardware is still around Right. A living community needs developers that write new code. And the repository structure is one important thing. But in my opinion it is not the most important one. With my 3-4 years experience in the kernel community I made the experience that the maintainers are the most important factor. I find a maintainer not commiting or caring about patches or not releasing new versions much worse than the wrong repository structure. oProfile has this problem with its userspace part. I partly made this bad experience with x86-64 before the architecture merge. KVM does not The biggest problem oProfile has is that it does not support per-process measuring. This is indeed not unfixable but it also doesn't fit well in In fact, the development process in KVM has improved over time. In the early beginnings everything was kept in svn. Avi switched to git some day but at the time when we had these kvm-XX releases both kernel- and user-space together were unbisectable. This has improved to a point where the kernel-part could be bisected. The KVM maintainers and community have shown in the past that they can address problems with the That would have been the best. Rusty already started this work and Because Marcelo and Avi try to keep as close to upstream qemu as possible. So the qemu repo is regularly merged in qemu-kvm and if you want to bisect you may end up somewhere in the middle of the qemu repository which has only very minimal kvm-support. The problem here is that two qemu repositorys exist. But the current effort of Anthony is directed to create a single qemu repository. But thats not done overnight. Merging qemu into the kernel would make Linus in fact a qemu maintainer. True. Tools for ...
It's in fact possible to bisect qemu-kvm.git. If you end up in qemu.git, do a 'git bisect skip'. If you end up in a merge, call the merge point A, bisect A^1..A^2, each time merging A^1 before compiling (the merge is always trivial due to the way we do it). Not fun, but it works. When we complete merging kvm integration into qemu.git, this problem will disappear. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
You are arguing why maintainers do not act as you suggest, against the huge negative effects of physical obscolescence? Please use common sense: they dont act because ... there are huge negative effects due to physical obscolescence? No amount of development model engineering can offset that negative. Thanks, Ingo --
The solution should be a long lived piece of code that runs without kernel privileges. How the code is delivered to the user is a separate problem. If you want to argue that the kernel should build an initramfs that contains some things that always should be shipped with the kernel but don't need to be within the kernel, I think that's something that's long over due. We could make it a kernel thread, but what's the point? It's much safer for it to be a userspace thread and it doesn't need to interact with the kernel in an intimate way. Regards, Anthony Liguori --
I did suggest a symbol server, and using a well-known location, though I'm unhappy with it. Multiple guest management should be done by the I am comfortable with having someone I trust maintain qemu. While sometimes Anthony overrides me on issues where I know I'm right and he's wrong, still I prefer that to having to do everything myself, I would surely do a worse job due to overload. I you actually look at qemu patches, the vast majority have little to do directly with kvm; and I (along with Marcelo) maintain the kvm That wouldn't change at all if I were to maintain it, since I wouldn't start writing a GUI for it and wouldn't force other contributors to do So, do you think a reply to a patch along the lines of NAK. Improving scalability is pointless while we don't have a decent GUI. I'll review you RCU patches _after_ you've contributed a usable GUI. For a given area, yes. It makes sense to clean up code before changing it, otherwise cruft accumulates rapidly. What you're describing is completely different and amounts to total disregard of contributors' The general health of qemu in terms of code quality was indeed pretty bad and there was (and is) a massive effort to modernise it. If you're interested look at qdev and qmp. Both are efforts to improve the infrastructure rather than add features on rotten code, and very successful IMO. There was no effort to write a GUI since no one appears If there were no capable maintainer I would reluctantly step in. That is not the case. If I were to displace Anthony then qemu quality would suffer, or I would have to drop kvm maintainership, or, if some false Neither do you. At least I have spent enough time among real usability people to know this. I don't have any pretences in this area and am happy to leave it to the experts. As infrastructure projects kvm and qemu should concentrate on providing flexible capabilities to consumers, which then expose it to users. ...
What does this have to do with RCU? I'm talking about KVM, which is a Linux kernel feature that is useless without a proper, KVM-specific app making use of it. RCU is a general kernel performance feature that works across the board. It helps KVM indirectly, and it helps many other kernel subsystems as well. It needs no user-space tool to be useful. KVM on the other hand is useless without a user-space tool. [ Theoretically you might have a fair point if it were a critical feature of RCU for it to have a GUI, and if the main tool that made use of it sucked. But it isnt and you should know that. ] Had you suggested the following 'NAK', applied to a different, relevant subsystem: | NAK. Improving scalability is pointless while we don't have a usable | tool. I'll review you perf patches _after_ you've contributed a usable | tool. you would have a fair point. In fact, we are doing that we are living by that. It makes absolutely zero sense to improve the scalability of perf if its usability sucks. So where you are trying to point out an inconsistency in my argument there is That is my precise point. KVM is a specific subsystem or "area" that makes no sense without the user-space tooling it relates to. You seem to argue that you have no 'right' to insist on good quality of that tooling - and IMO you are fundamentally wrong with that. Thanks, Ingo --
The example was rcuifying kvm which took place a bit ago. Sorry, it Correct. So should I tell someone that has sent a patch that rcu-ified kvm in order to scale it, that I won't accept the patch unless they do some usability userspace work? say, implementing an eject button. That might hold, but the tool is usable at least for some people - it runs in production. The people running it won't benefit from an eject button or any usability improvement since they run it through a centralized management tool that hides everything. They will benefit from the scalability patches. Should I still make those patches kvm contains many sub-areas. I'm not going to tie unrelated things together like the GUI and sclability, configuration file format and emulator correctness, nested virtualization and qcow2 asynchronity, or other crazy combinations. People either leave en mass or become frustrated if they can't. I do reject patches touching a sub-area that I think need to be done in userspace, for example. That's not to say kvm development is random. We have a weekly conference call where regular contributors and maintainers of both qemu and kvm participate and where we decide where to focus. Sadly the issue of a qemu GUI is not raised often. Perhaps you can participate and voice your concerns. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Of course you could say the following:
' Thanks, I'll mark this for v2.6.36 integration. Note that we are not
able to add this to the v2.6.35 kernel queue anymore as the ongoing
usability work already takes up all of the project's maintainer and
testing bandwidth. If you want the feature to be merged sooner than that
then please help us cut down on the TODO and BUGS list that can be found
at XYZ. There's quite a few low hanging fruits there. '
Although this RCU example is 'worst' possible example, as it's a pure speedup
change with no functionality effect.
Consider the _other_ examples that are a lot more clear:
' If you expose paravirt spilocks via KVM please also make sure the KVM
tooling can make use of it, has an option for it to configure it, and
that it has sufficient efficiency statistics displayed in the tool for
admins to monitor.'
' If you create this new paravirt driver then please also make sure it can
be configured in the tooling. '
' Please also add a testcase for this bug to tools/kvm/testcases/ so we dont
repeat this same mistake in the future. '
I'd say most of the high-level feature work in KVM has tooling impact.
And note the important arguement that the 'eject button' thing would not occur
naturally in a project that is well designed and has a good quality balance.
It would only occur in the transitionary period if a big lump of lower-quality
code is unified with higher-quality code. Then indeed a lot of pressure gets
created on the people working on the high-quality portion to go over and fix
the low-quality portion.
Which, btw., is an unconditonally good thing ...
But even an RCU speedup can be fairly linked/ordered to more pressing needs of
a project.
Really, the unification of two tightly related pieces of code has numerous
clear advantages. Please give it some thought before rejecting it.
Thanks,
Ingo
--
That would be shooting at my own foot as well as the contributor's since I badly want that RCU stuff, and while a GUI would be nice, that itch isn't on my back. You're asking a developer and a maintainer to put off the work they're interested in, in order to work on something someone else is interested All three happen quite commonly in qemu/kvm development. Of course someone who develops a feature also develops a patch that exposes it in Usually, pretty low. Plumbing down a feature is usually trivial. There are exceptions, of course - smp is only supported in qemu-kvm.git, not in upstream qemu.git, for example. In any case of course the work is done in both qemu and kvm - do you think people develop features to see I'm not blind to the advantages. Dropping tcg would be the biggest of them by far (much more than moving the repository, IMO). But there are disadvantages as well. Around two years ago I seriously considered forking qemu, at this time I do not think it is a good idea. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I think this sums up the root cause of all the problems i see with KVM pretty well. Thanks, Ingo --
A good maintainer has to strike a balance between asking more of people than what they initially volunteer and getting people to implement the less fun things that are nonetheless required. The kernel can take this to an extreme because at the end of the day, it's the only game in town and there is an unending number of potential volunteers. Most other projects are not quite as fortunate. When someone submits a patch set to QEMU implementing a new network backend for raw sockets, we can push back about how it fits into the entire stack wrt security, usability, etc. Ultimately, we can arrive at a different, more user friendly solution (networking helpers) and along with some time investment on my part, we can create a much nicer, more user friendly solution. Still command line based though. Responding to such a patch set with, replace the SDL front end with a GTK one that lets you graphically configure networking, is not reasonable and the result would be one less QEMU contributor in the long run. Overtime, we can, and are, pushing people to focus more on usability. But that doesn't get you a first class GTK GUI overnight. The only way you're going to get that is by having a contributor be specifically interesting in building such a thing. We simply haven't had that in the past 5 years that I've been involved in the project. If someone stepped up to build this, I'd certainly support it in every way possible and there are probably some steps we could take to even further encourage this. Regards, Anthony Liguori --
Sorry to be blunt, but i dont think there's a different way to say it: i am a user of the software you are maintaining (Qemu) and i dont think you have the basis to educate people about what a good maintainer should do to achieve a quality end result. I think you could/should learn much from Linus and others who very much require quality to permeate the full dimension of a contribution (including usability), beyond the narrow, local scope of the contribution. Thanks, Ingo --
I think we agree at last. Neither I nor my employer are interested in running qemu as a desktop-on-desktop tool, therefore I don't invest any effort in that direction, or require it from volunteers. If you think a good GUI is so badly needed, either write one yourself, or convince someone else to do it. (btw, why are you interested in desktop-on-desktop? one use case is developers, which don't really need fancy GUIs; a second is people who test out distributions, but that doesn't seem to be a huge population; and a third is people running Windows for some application that doesn't run on Linux - hopefully a small catergory as well. Seems to be quite a small target audience, compared to, say, video editing) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Obviously your employer at least in part defers to you when it comes to KVM priorities. So, just to make this really clear, _you_ are not interested in running qemu as a desktop-on-desktop tool, subsequently this kind of disinterest-for-desktop-usability trickled through the whole KVM stack and poisoned your attitude and your contributor's attitude. Too sad really and it's doubly sad that you dont feel anything wrong about To a certain degree we are trying to do a small bit of that (see this very thread) - and you are NAK-ing and objecting the heck out of it via your unreasonable microkernelish and server-centric views. With constant maintainer disinterest there's no wonder a non-desktop-oriented KVM becomes a self-fulfilling prophecy: you think the desktop does not matter, hence it becomes a reality in KVM space which you can constantly refer back to as a 'fact'. I'm interested in desktop-on-desktop because i walk this world with open eyes and i care about Linux, and these days qemu-kvm is the first thing a new Linux user sees about Linux virtualization. I've observed several people i know in person to turn away from Linux and go back to Windows or go over to Apple because they had a much more mature solution. I'd probably turn away from Linux myself if i were a newbie and if i were forced to use KVM on the desktop today. Again, you dont seem to realize that you as a maintainer are at a central point where you have the ability to turn the self-fulfilling prophecy that 'nobody cares about the Linux desktop' into a reality - or where you have the ability to prevent it from happening. It is hugely harmful process, especially as you seem to delude yourself that you have nothing to do with it. Anyway, it's good you expressed your views about this as this will help the chances of a fresh restart. (which chances are still not too good though) Thanks, Ingo --
Please, don't jump to unjust conclusions. The whole point is that there's no money behind desktop-on-desktop virtualization. Thus nobody pays people to work on it. Thus nothing significant happens in that space. If there was someone standing up to create a really decent desktop qemu front-end I'm confident we'd even officially suggest using that. In fact, that whole discussion did come up in the weekly Qemu/KVM community call and everybody agreed heavily that we do need a desktop client. The problem is just that there is nobody standing up. And I hope you don't expect Avi to be the one creating a GUI. Alex --
Besides, Ingo could just go ahead and use libvirt together with virt-manager. It solves a few of the usability issues he came up with somewhere in this thread, is available even in every current distribution, and *actually* works quite well for the desktop usecase. It just desparatly needs more brainpower and manpower to make it a competitor to VirtualBox & Co, because its not as polished and featurecomplete yet. But I bet virt-managers maintainers welcome patches to fix and enhance usability. Most of the needed fixes probably wouldn't touch qemu at all, let alone kvm. Sorry to chime in with my opinion, but this whole thread is incredibly boring and full of non-arguments yet really highly amusing. -- Lukas --
I am also disinterested in ppc virtualization, yet it happened. I am disinterested in ia64 virtualization, yet it happened. I am disinterested in s390 virtualization, yet it happened. Linus doesn't care about virtualization, yet it happened. I don't tell my contributor what to be interested in, only whether their patches are good or not. I can tell you that Red Hat contributors don't work on a desktop kvm GUI not because I discourage them, but because the product we are working on does not contain a desktop kvm GUI. Jan Kiszka contributed a lot of debugger features, fixes, and improvement, presumably he and/or his employer need that more than a kvm desktop GUI. I can't see why you see anything wrong with this. People write patches It would be lovely to have a desktop kvm GUI. I don't feel I have to The perf bits have nothing to do with a GUI or usability for general users. Calling them "unreasonable microkernelish sever-centric views" It's a fact that virtualization is happening in the data center, not on the desktop. You think a kvm GUI can become a killer application? fine, write one. You don't need any consent from me as kvm maintainer (if patches are needed to kvm that improve the desktop experience, I'll accept them, though they'll have to pass my unreasonable microkernelish filters). If you're right then the desktop kvm GUI will be a huge hit with zillions of developers and people will drop Windows and switch to Linux just to use it. But my opinion is that it will end up like virtualbox, a nice app that If you're going to use words like 'dishonest' then please don't send me Which distribution are they using? Most people would see virt-manager as the first thing, not open gnome-terminal and start typing in the qemu command line. While it's not perfect, it does have a shiny GUI with It doesn't have to be me. Better to pick someone who has a clue about usability to design and guide this effort. That someone can work ...
You should know the answer yourself: the difference is that usability is a core quality of any project. I as a maintainer can be neutral towards a number of features and patch attributes that i dont consider key aspects. (although they can grow out to become key features in the future. SMP was a fringe thing 15 years ago.) Usability is not an attribute you can ignore and i for sure am never neutral towards usability deficiencies in patches - i consider usability a key Whether a feature is usable or not is sure a metric of 'goodness'. You have restricted your metric of goodness artificially to not include usability. You do that by claiming that the user-space tooling of KVM, while being functionally absolutely essential for any user to even try out KVM, is 'separate' and has no quality connection with the kernel bits of KVM. It is a convenient argument that allows you to do the kernel bits only. It is absolutely catastrophic to the user who'd like to see a usable solution and a single project who stands behind their tech. Thus, _today_, after years of neglect, you can claim that none of the dozens of usability problems of KVM has anything to do with the features you are working on today. It's in a separate project (the so-called 'Qemu' package) after all - none of KVM's business. In reality if you consider it a single project then those bugs were all usability problems introduced earlier on, years ago, when a piece of functionality was exposed via KVM. It adds up and now you claim they have nothing to do with current work. This is why i consider that line of argument rather dishonest ... Ingo --
I am not going to reply to any more email from you on this thread. -- error compiling committee.c: too many arguments to function --
Because i pointed out that i consider a line of argument intellectually dishonest? I did not say _you_ as a person are dishonest - doing that would be an ad honimen attack against your person. (In fact i dont think you are, to the contrary) An argument can certainly be labeled dishonest in a fair discussion and it is not a personal attack against you to express my opinion about that. Thanks, Ingo --
You're being excessively rude in this thread. That might be acceptable on LKML but it's not how the QEMU and KVM communities behave. This thread is a good example of why LKML has the reputation it has. Avi and I argue all of the time on qemu-devel and kvm-devel and it's never degraded into a series of personal attacks like this. I've been trying very hard to turn this into a productive thread attempting to capture your feedback and give clear suggestions about how you can solve achieve your desired functionality. What are you looking to achieve? To you just want to piss and moan about how terrible you think Avi and I are? Or do you want to try to actually help make things better? If you want to help make things better, please focus on making constructive suggestions and clarifying what you see as requirements. Regards, Anthony Liguori --
I'm glad that we are at this more productive stage. I'm still trying to achieve the very same technological capabilities that i expressed in the first few mails when i reviewed the 'perf kvm' patch that was submitted by Yanmin. The crux of the problem is very simple. To quote my earlier mail: | | - The inconvenience of having to type: | perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms \ | --guestmodules=/home/ymzhang/guest/modules top | | | is very obvious even with a single guest. Now multiply that by more guests ... | For example we want 'perf kvm top' to do something useful by default: it should find the first guest running and it should report its profile. The tool shouldnt have to guess about where the guests are, what their namespaces is and how to talk to them. We also want easy symbolic access to guest, for example: perf kvm -g OpenSuse-2 record sleep 1 I.e.: - Easy default reference to guest instances, and a way for tools to reference them symbolically as well in the multi-guest case. Preferably something trustable and kernel-provided - not some indirect information like a PID file created by libvirt-manager or so. - Guest-transparent VFS integration into the host, to recover symbols and debug info in binaries, etc. There were a few responses to that but none really addressed those problems - they mostly tried to re-define the problem and suggested that i was wrong to want such capabilities and suggested various inferior approaches instead. See the thread for the details - i think i covered every technical suggestion that was made. So we are still at an impasse as far as i can see. If i overlooked some suggestion that addresses these problems then please let me know ... Thanks, Ingo --
Two things are needed. The first thing needed is to be able to
enumerate running guests and identify a symbolic name. I have a patch
for this and it'll be posted this week or so. perf will need to have a
QMP client and it will need to look in ${HOME}/.qemu/qmp/ to sockets to
connect to.
This is too much to expect from a client and we've got a GSoC idea
posted to make a nice library for tools to use to simplify this.
The sockets are named based on UUID and you'll have to connect to a
guest and ask it for it's name. Some guests don't have names so we'll
A guest is not a KVM concept. It's a qemu concept so it needs to be
something provided by qemu. The other caveat is that you won't see
guests created by libvirt because we're implementing this in terms of a
default QMP device and libvirt will disable defaults. This is desired
behaviour. libvirt wants to be in complete control and doesn't want a
The way I'd like to see this implemented is a guest userspace daemon. I
think having the guest userspace daemon be something that can be updated
by the host is reasonable.
In terms of exposing that on the host, my preferred approach is QMP.
I'd be happy with a QMP command that is essentially,
guest_fs_read(filename) and guest_fd_readdir(path).
If desired, one could implement a fuse filesystem that interacted with
all local qemu instances to expose this on the host. There's a lot of
ugly things about fuse though so I think sticking to QMP is best
(particularly with respect to root access of a fuse filesystem).
With just those couple things in place, perf should be able to do
exactly what you want it to do.
Regards,
--
Ok, that sounds interesting! I'd rather see some raw mechanism that 'perf kvm' could use instead of having to require yet another library (which generally dampens adoption of a tool). So i think we can work from there. Btw., have you considered using Qemu's command name (task->comm[]) as the symbolic name? That way we could see the guest name in 'top' on the host - a I think just exposing the UUID in that lazy case would be adequate? It creates Hm, this sucks for multiple reasons. Firstly, perf isnt a tool that 'interacts', it's an observation tool: just like 'top' is an observation tool. We want to enable developers to see all activities on the system - regardless of who started the VM or who started the process. Imagine if we had a way to hide tasks to hide from 'top'. It would be rather awful. Secondly, it tells us that the concept is fragile if it doesnt automatically enumerate all guests, regardless of how they were created. Full system enumeration is generally best left to the kernel, as it can offer coherent access. Ingo --
qemu-system-x86_64 -name Fedora,process=qemu-Fedora Does exactly that. We don't make this default based on the element of least surprise. Many users expect to be able to do killall Perf does interact with a guest though because it queries a guest to read it's file system. I understand the point you're making though. If instead of doing a pull interface where the host queries the guest for files, if the guest pushed a small set of files at startup which the host cached, then you could potentially unconditionally expose a "read-only" socket that only I don't see why qemu can't offer coherent access. The limitation today is intentional and if it's overly restrictive, we can figure out a means to change it. Regards, --
Well, in a sense a guest is a KVM concept too: it's in essence represented via the 'vcpu state attached to a struct mm' abstraction that is attached to the /dev/kvm file descriptor attached to a Linux process. Multiple vcpus can be started by the same process to represent SMP, but the whole guest notion is present: a Linux MM that carries KVM state. In that sense when we type 'perf kvm list' we'd like to get a list of all currently present guests that the developer has permission to profile: i.e. we'd like a list of all [debuggable] Linux tasks that have a KVM instance attached to them. A convenient way to do that would be to use the Qemu process's ->comm[] name, and to have a KVM ioctl that gets us a list of all vcpus that the querying task has ptrace permission to. [the standard permission check we do for instrumentation] No need for communication with Qemu for that - just an ioctl, and an always-guaranteed result that works fine on a whole-system and on a per user basis as well. Thanks, Ingo --
You need a way to interact with the guest which means you need some type of device. All of the interesting devices are implemented in qemu so you're going to have to interact with qemu if you want meaningful interaction with a guest. Regards, --
No, you're not. You're trying to fracture the qemu community with your tools/kvm proposal, you're explaining to me how I'm working on the wrong thing by concentrating on things that my employer needs rather than what you think kvm needs, and attaching various unsavoury labels to Anthony and myself. Any wonder we aren't getting anything done? If you can commit to a reasonable conversation we might be able to make Usually 'layering violation' is trotted out at such suggestions. I don't like using the term, because sometimes the layers are incorrect and need to be violated. But it should be done explicitly, not as a shortcut for a minor feature (and profiling is a minor feature, most users will never use it, especially guest-from-host). The fact is we have well defined layers today, kvm virtualizes the cpu and memory, qemu emulates devices for a single guest, libvirt manages guests. We break this sometimes but there has to be a good reason. So perf needs to talk to libvirt if it wants names. Could be done via You simply kept ignoring me when I said that if something can be kept out of the kernel without impacting performance, it should be. I don't want emergency patches closing some security hole or oops in a kernel symbol server. The usability argument is a red herring. True, it takes time for things to trickle down to distributions and users. Those who can't wait can The impasse is mostly due to you insisting on doing everything your way, in the kernel, and disregarding how libvirt/qemu/kvm does things. Learn the kvm ecosystem, you'll find it is quite easy to contribute code. -- error compiling committee.c: too many arguments to function --
Or rather, explained how I am a wicked microkernelist. The herring were out in force today. -- error compiling committee.c: too many arguments to function --
Well, if it's not being a "wicked microkernelist" then what is it?
Performance is hardly the only motivation to put things into the
kernel. Think kernel mode-setting and devtmpfs (with the ironic twist
of original devfs being removed from the kernel) here, for example.
Pekka
--
Motivations include privileged device access, needing to access physical memory, security, and keeping the userspace interface sane. There are others. I don't think any of them hold here. -- error compiling committee.c: too many arguments to function --
[ Sidenote: i still received no adequate suggestions about how to provide this That's weird, how can a feature request be a 'layering violation'? If something that users find straightforward and usable is a layering violation to you (such as easily being able to access their own files on the host as well ...) then i think you need to revisit the definition of that term I never suggested an "in kernel space symbol server" which could oops, why would i have suggested that? Please point me to an email where i suggested It's not just "download and compile", it's also "configure correctly for several separate major distributions" and "configure to per guest instance local rules". It's far more fragile in practice than you make it appear to be, and since you yourself expressed that you are not interested much in the tooling side, how can you have adequate experience to judge such matters? In fact for instrumentation it's beyond a critical threshold of fragility - instrumentation above all needs to be accessible, transparent and robust. If you cannot see the advantages of a properly integrated solution then i suspect there's not much i can do to convince you. And you ignored not just me but you ignored several people in this thread who thought the current status quo was inadequate and expressed interest in both the VFS integration and in the guest enumeration features. Thanks, Ingo --
You need to integrate with libvirt to convert guest names something that The 'something trustable and kernel-provided'. The kernel knows nothing You insisted that it be in the kernel. Later you relaxed that and said a daemon is fine. I'm not going to reread this thread, once is more That's life in Linux-land. Either you let distributions feed you cooked packages and relax, or you do the work yourself. If we had People on kvm-devel manage to build and run release tarballs and even directly from git. I build packages from source occasionally. It isn't Integration in Linux happens at the desktop or distribution level. You want to move it to the kernel level. It works for perf, great, but that doesn't mean it will work for everything else. Once perf grows a GUI, I expect it will stop working for perf as well (for example, if gtk breaks I'm sorry. I don't reply to every email. If you want my opinion on something, you can ask me again. -- error compiling committee.c: too many arguments to function --
The kernel certainly knows about other resources such as task names or network
This is really just the much-discredited microkernel approach for keeping
global enumeration data that should be kept by the kernel ...
Lets look at the ${HOME}/.qemu/qmp/ enumeration method suggested by Anthony.
There's numerous ways that this can break:
- Those special files can get corrupted, mis-setup, get out of sync, or can
be hard to discover.
- The ${HOME}/.qemu/qmp/ solution suggested by Anthony has a very obvious
design flaw: it is per user. When i'm root i'd like to query _all_ current
guest images, not just the ones started by root. A system might not even
have a notion of '${HOME}'.
- Apps might start KVM vcpu instances without adhering to the
${HOME}/.qemu/qmp/ access method.
- There is no guarantee for the Qemu process to reply to a request - while
the kernel can always guarantee an enumeration result. I dont want 'perf
kvm' to hang or misbehave just because Qemu has hung.
Really, for such reasons user-space is pretty poor at doing system-wide
enumeration and resource management. Microkernels lost for a reason.
You are committing several grave design mistakes here.
Thanks,
Ingo
--
But it doesn't know about guest names. You can't trust task names since any user can create a task with any name. Network interfaces are root only so you can trust their names. There are dozens or even hundreds of object classes the kernel does not know about and cannot enumerate. User names, for instance. X sessions. Windows (the screen artifact, not the OS). CIFS shares exported by this machine. Currently running applications (not processes). btw, network interfaces would have been much better of using I disagree it should be kept in the kernel. Why introduce a new namespace, with APIs to query it, manage it, rules regarding conflicts, Take a look at your desktop, userspace is doing all of that everywhere, from enumerating users and groups, to deciding how your disks are I am committing on the shoulders of giants. -- error compiling committee.c: too many arguments to function --
We're stuck in a rut with libvirt and I think a lot of the dissatisfaction with qemu is rooted in that. It's not libvirt that's the probably, but the relationship between qemu and libvirt. We add a feature to qemu and maybe after six month it gets exposed by libvirt. Release time lines of the two projects complicate the situation further. People that write GUIs are limited by libvirt because that's what they're told to use and when they need something simple, they're presented with first getting that feature implemented in qemu, then plumbed through libvirt. It wouldn't be so bad if libvirt was basically a passthrough interface to qemu but it tries to model everything in a generic way which is more or less doomed to fail when you're adding lots of new features (as we are). The list of things that libvirt doesn't support and won't any time soon is staggering. libvirt serves an important purpose, but we need to do a better job in qemu with respect to usability. We can't just punt to libvirt. Regards, Anthony Liguori --
That is somewhat unfair as a blanket statement! While some features have had a long time delay & others are not supported at all, in many cases we have had zero delay. We have been supporting QMP, qdev, vhost-net since before the patches for those features were even merged in QEMU GIT! It varies depending on how closely QEMU & libvirt people have been working together on a feature, and on how strongly end users are demanding As previously discussed, we want to improve both the set of features supported, and make it much easier to support new features promptly. The QMP & qdev stuff has been a very good step forward in making it easier to support QEMU management. There have been a proposals from several people, yourself included, on how to improve libvirt's support for the full range of QEMU features. We're committed to looking at this and figuring out which proposals are practical to support, so we can improve QEMU & libvirt interaction for everyone. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
Sorry, you're certainly correct. Some features appear quickly, but Regards, --
Yes. I think the point was that every layer in between brings potential slowdown and loss of features. Hopefully this will go away with QMP. By then people can decide if they want to be hypervisor agnostic (libvirt) or tightly coupled with qemu (QMP). The best of both worlds would of course be a QMP pass-through in libvirt. No idea if that's easily possible. Either way, things are improving. What people see at the end is virt-manager though. And if you compare if feature-wise as well as looks-wise vbox is simply superior. Several features lacking in lower layers too (pv graphics, always working absolute pointers, clipboard sharing, ...). That said it doesn't mean we should resign. It means we know which areas to work on :-). And we know that our problem is not the kernel/userspace interface, but the qemu and above interfaces. Alex--
Exactly. The more 'fragmented' a project is into sub-projects, without a single, unified, functional reference implementation in the center of it, the longer it takes to fix 'unsexy' problems like trivial usability bugs. Furthermore, another negative effect is that many times features are implemented not in their technically best way, but in a way to keep them local to the project that originates them. This is done to keep deployment latencies and general contribution overhead down to a minimum. The moment you have to work with yet another project, the overhead adds up. So developers rather go for the quicker (yet inferior) hack within the sub-project they have best access to. Tell me this isnt happening in this space ;-) Thanks, Ingo --
I disagree there. Keeping things local and self-contained has been the UNIX secret. It works really well as long as the boundaries are well defined. Well - not necessarily hacks. It's more about project boundaries. Nothing is bad about that. You wouldn't want "ls" implemented in the Linux kernel either, right? :-) Alex--
The 'UNIX secret' works for text driven pipelined commands where we are essentially programming via narrow ASCII input of mathematical logic. It doesnt work for a GUI that is a 2D/3D environment of millions of pixels, Have you made thoughts about why that might be so? I think it's because of what i outlined above - that you are trying to apply the "UNIX secret" to GUIs - and that is a mistake. A good GUI is almost at the _exact opposite spectrum_ of good command-line tool: tightly integrated, with 'layering violations' designed into it all over the place: look i can paste the text from an editor straight into a firefox form. I didnt go through any hiearchy of layers, i just took the shortest path between the apps! In other words: in a GUI the output controls the design, for command-line tools the design controls the output. It is no wonder Unix always had its problems with creating good GUIs that are efficient to humans. A good GUI works like the human brain, and the human brain does not mind 'layering violations' when that gets it a more efficient I guess you are talking to the wrong person as i actually have implemented ls functionality in the kernel, using async IO concepts and extreme threading ;-) It was a bit crazy, but was also the fastest FTP server ever running on this planet. Ingo --
Modularization is needed when a project exceeds the average developer's capacity. For kvm, it is logical to separate privileged cpu virtualization, from guest virtualization, from host management, from Nope. You copied text from one application into the clipboard (or selection, or PRIMARY, or whatever ) and pasted text from the clipboard to another application. If firefox and your editor had to interact directly, all would be lost. See - there was a global (for the session) third party, and it wasn't The problem is that only developers are involved, not people who understand human-computer interaction (in many cases, not human-human interaction either). Another problem is that a good GUI takes a lot of work so you need a lot of committed resources. A third problem is that it isn't a lot of fun, at least not the 20% of the work that take 800% of the time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
On Mon, 2010-03-22 at 21:21 +0100, Ingo Molnar wrote: Yes. Forword: I assume with "GUI" you mean "a user interface for the classical desktop user with next to no interest in learning details or basics". That doesn't mean the classical desktop user is silly, stupid or No, it's the very same mechanism. But you just have to start at the correct point. In the kernel/device driver world, you start at the device. And in the GUI world, you better start at the GUI (and not some kernel ACK, because you to make the GUI understandable to the intended users. If that means "hiding 90% of all possibilities and features", you just hide them. Of course, the user of such an UI is quite limited doesn't use much of the functionality - because s/he can't access it through the GUI - (but presenting 100% - or even 40% - doesn't help either as s/he won't ACK, because the user in this case (which is most of the time a developer, sys-admin, or similar techie) *wants* an 1:1 picture of the underlying model because s/he already *knows* the underlying model (and ACK. The clichee-Unix-person doesn't come from the "GUI world". So most If this is the case, the layering/structure/design of the GUI is (very) badly defined/chosen (for whatever reason). [ Most probably because some seasoned software developer designed the GUI-app *without* designing (and testing!) the GUI (or more to the point: the look - how does it look like - and feel - how does it behave, what are the possible workflows, ... - of it) first. ] Bernd -- Bernd Petrovitsch Email : bernd@petrovitsch.priv.at LUGA : http://www.luga.at --
Integration is hard, requires a wider set of technical skills and getting good test coverage becomes more difficult. But I agree that it is worth the effort, kvm could reap large rewards from putting a greater emphasis on integration (ala vbox) - no matter how it is achieved (cowardly not taking sides on implementation decisions like repository locations). --
+1 The obvious reason why so many people still use shell scripts rather than libvirt is because if it just doesn't provide what they need. Every time I've looked at it (and I've been looking for a better solution for many years), it seems that it would have provided most of the things I needed, but the remaining bits were unsolvable. Shell scripts can be ugly, but you get total control. --
If you happen to remember what missing features prevented you choosing libvirt, that would be invaluable information for us, to see if there are quick wins that will help out. We got very useful feedback when recently asking people this same question http://rwmj.wordpress.com/2010/01/07/quick-quiz-what-stops-you-from-using-libvirt/ Allowing arbitrary passthrough of QEMU commands/args will solve some of these issues, but certainly far from solving all of them. eg guest cut+ paste, host side control of guest screen resolution, easier x509/TLS configuration for remote management, soft reboot, Windows desktop support for virt-manager, host network interface management/setup, etc Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
Which has pretty much the same problems to the ${HOME}/.qemu/qmp/ solution,
Erm, but i'm talking about a dead tool here. There's a world of a difference
between 'kvm top' not showing new entries (because the guest is dead), and
'perf kvm top' hanging due to Qemu hanging.
So it's essentially 4 our of 4. Yet your reply isnt "Ingo you are right" but
We dont do that for robust system instrumentation, for heaven's sake!
By your argument it would be perfectly fine to implement /proc purely via
Really, this is getting outright ridiculous. You agree with me that Anothony
suggested a technically inferior solution, yet you even seem to be proud of it
and are joking about it?
And _you_ are complaining about lkml-style hard-talk discussions?
Thanks,
Ingo
--
It doesn't follow. The libvirt daemon could/should own guests from all users. I don't know if it does so now, but nothing is preventing it My reply is "you are right" (phrased earlier as "I don't like it either" meaning I agree with your dislike). One of your criticisms was invalid, If qemu fails, you lose your guest. If libvirt forgets about a guest, you can't do anything with it any more. These are more serious problems than 'perf kvm' not working. Qemu and libvirt have to be robust anyway, we can rely on them. Like we have to rely on init, X, sshd, and a I would have preferred /proc to be implemented via syscalls called directly from tools, and good tools written to expose the information in it. When computers were slower 'top' would spend tons of time opening and closing all those tiny files and parsing them. Of course the kernel In every Linux system userspace is doing or proxying much of the enumeration and resource management. So if enumerating guests in There is a difference between joking and insulting people. I enjoy jokes but I dislike being insulted. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
It's hard for me to argue against a hypothetical implementation, but all user-space driven solutions for resource enumeration i've seen so far had I think you didnt understand my point. I am talking about 'perf kvm top' hanging if Qemu hangs. With a proper in-kernel enumeration the kernel would always guarantee the functionality, even if the vcpu does not make progress (i.e. it's "hung"). With this implemented in Qemu we lose that kind of robustness guarantee. And especially during development (when developers use instrumentation the most) is it important to have robust instrumentation that does not hang along How on earth can you justify a bug ("perf kvm top" hanging) with that there are other bugs as well? Basically you are arguing the equivalent that a gdb session would be fine to become unresponsive if the debugged task hangs. Fortunately ptrace is kernel-based and it never 'hangs' if the user-space process hangs somewhere. This is an essential property of good instrumentation. So the enumeration method you suggested is a poor, sub-part solution, simple We can still profile any of those tools without the profiler breaking if the (Then you'll be enjoyed to hear that perf has enabled exactly that, and that we are working towards that precise usecase.) Ingo --
Use non-blocking I/O, report that guest as dead. No point in profiling If qemu has a bug in the resource enumeration code, you can't profile one guest. If the kernel has a bug in the resource enumeration code, It's nice not to have kernel oopses either. So when code can be in There's no reason for 'perf kvm top' to hang if some process is not Are you exporting /proc/pid data via the perf syscall? If so, I think that's a good move. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Erm, at what point do i decide that a guest is 'dead' versus 'just lagged due to lots of IO' ? Also, do you realize that you increase complexity (the use of non-blocking IO), just to protect against something that wouldnt happen if the right This is really simple code, not rocket science. If there's a bug in it we'll fix it. On the other hand a 500KLOC+ piece of Qemu code has lots of places to hang, so that is a large cross section. Ingo --
qemu shouldn't block due to I/O (it does now, but there is work to fix it). Of course it could be swapping or other things. Pick a timeout, everything we do has timeouts these days. It's the price we pay for protection: if you put something where a failure can't hurt you, you have to be prepared for failure, and you might have false alarms. Is it so horrible for 'perf kvm top'? No user data loss will happen, surely? On the other hand, if it's in the kernel and it fails, you will lose It's a tradeoff. Increasing the kernel code size vs. increasing The kernel has tons of very simple code (and some very complex code as well), and tons of -stable updates as well. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Not all KVM vcpus are running operating systems. Transitive had a product that was using a KVM context to run their binary translator which allowed them full access to the host processes virtual address space range. In this case, there is no kernel and there are no devices. That's what I mean by a guest being a userspace context. KVM simply provides a new CPU mode to userspace in the same way that vm8086 mode. Regards, --
And your point is that such vcpus should be excluded from profiling just because they fall outside the Qemu/libvirt umbrella? That is a ridiculous position. Ingo --
You don't instrument it the way you'd instrument an operating system so no, you don't want it to show up in perf kvm top. Regards, --
Erm, why not? It's executing a virtualized CPU, so sure it makes sense to allow the profiling of it! It might even not be the weird case you mentioned by some competing virtualization project to Qemu ... So your argument is wrong on several technical levels, sorry. Thanks, Ingo --
It may not make sense to have symbol tables for it, for example it isn't generated from source code but from binary code for another architecture. Of course, just showing addresses is fine, but you don't need qemu for that. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Non-guest vcpus will not be able to provide Linux-style symbols. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
And why do you say that it makes no sense to profile them? Also, why do you define 'guest vcpus' to be 'Qemu started guest vcpus'? If some other KVM using project (which you encouraged just a few mails ago) starts a vcpu we still want to be able to profile them. Ingo --
It makes sense to profile them, but you don't need to contact their Maybe it should provide a mechanism for libvirt to list it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
If your position basically boils down to, we can't trust userspace and we can always trust the kernel, I want to eliminate any userspace path, then I can't really help you out. I believe we can come up with an infrastructure that satisfies your actual requirements within qemu but if you're also insisting upon the above implementation detail then there's nothing I can do. Regards, Anthony Liguori --
Why would you want to 'help me out'? I can tell a good solution from a bad one just fine. You should instead read the long list of disadvantages above, invert them and list then as advantages for the kernel-based vcpu enumeration solution, apply common sense and go admit to yourself that indeed in this situation a kernel provided enumeration of vcpu contexts is the most robust solution. It's really as simple as that :-) Thanks, Ingo --
You are basically making a kernel implementation a requirement, instead Having qemu enumerate guests one way or another is not a good idea IMO since it is focused on one guest and doesn't have a system-wide entity. A userspace system-wide entity will work just as well as kernel implementation, without its disadvantages. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
A system-wide user-space entity only solves one problem out of the 4 i listed, still leaving the other 3: - Those special files can get corrupted, mis-setup, get out of sync, or can be hard to discover. - Apps might start KVM vcpu instances without adhering to the system-wide access method. - There is no guarantee for the system-wide process to reply to a request - while the kernel can always guarantee an enumeration result. I dont want 'perf kvm' to hang or misbehave just because the system-wide entity has hung. Really, i think i have to give up and not try to convince you guys about this anymore - i dont think you are arguing constructively anymore and i dont want yet another pointless flamewar about this. Please consider 'perf kvm' scrapped indefinitely, due to lack of robust KVM instrumentation features: due to lack of robust+universal vcpu/guest enumeration and due to lack of robust+universal symbol access on the KVM side. It was a really promising feature IMO and i invested two days of arguments into it trying to find a workable solution, but it was not to be. Thanks, Ingo --
That's a hard requirement anyway. If it happens, we get massive data loss. Way more troubling than 'perf kvm top' doesn't work. So consider Then you don't get their symbol tables. That happens anyway if the symbol server is not installed, not running, handing out fake data. So When you press a key there is no guarantee no component along the way I am not going to push libvirt or a subset thereof into the kernel for 'perf kvm'. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
There always needs to be a system wide entity. There are two ways to enumerate instances from that system wide entity. You can centralize the creation of instances and there by maintain an list of current instances. You can also allow instances to be created in a decentralized manner and provide a standard mechanism for instances to register themselves with the system wide entity. IOW, it's the difference between asking libvirtd to exec(qemu) vs allowing a user to exec(qemu) and having qemu connect to a well known unix domain socket for libvirt to tell libvirtd that it exists. The later approach has a number of advantages. libvirt already supports both models. The former is the '/system' uri and the later is the '/session' uri. What I'm proposing, is to use the host file system as the system wide entity instead of libvirtd. libvirtd can monitor the host file system to participate in these activities but ultimately, moving this functionality out of libvirtd means that it becomes the standard mechanism for all qemu instances regardless of how they're launched. Regards, --
I don't like dropping sockets into the host filesystem, especially as they won't be cleaned up on abnormal exit. I also think this breaks our 'mechanism, not policy' policy. Someone may want to do something weird with qemu that doesn't work well with this. We could allow starting monitors from the global configuration file, so a distribution can do this if it wants, but I don't think we should do this ourselves by default. -- error compiling committee.c: too many arguments to function --
The approach I've taken (which I accidentally committed and reverted) was to set this up as the default qmp device much like we have a default monitor device. A user is capable of overriding this by manually I've looked at making default devices globally configurable. We'll get there but I think that's orthogonal to setting up a useful default qmp device. Regards, Anthony Liguori --
I think the latter is exactly what I would want for myself. I do see the advantages of having a central instance, but I really don't want to bother with libvirt configuration files or even GUIs just to get an ad-hoc VM up when I can simply run "qemu -hda hd.img -m 1024". Let alone that I usually want to have full control over qemu, including monitor access and small details available as command line options. I know that I'm not the average user with these requirements, but still I am one user and do have these requirements. If I could just install libvirt, continue using qemu as I always did and libvirt picked my VMs up for things like global enumeration, that would be more or less the optimal thing for me. Kevin --
+1 And it would also make it more likely that users like us would convert to libvirt in the long run, by providing an easy and integrated transition path. I've had another look at libvirt, and one of the things that is holding me back is the cost of moving existing scripts to libvirt. If it could just pick up what I have (at least in part), then I don't have to. --
And this system wide entity is the kvm module. It creates instances of 'struct kvm' and destroys them. I see no problem if we just attach a name to every instance with a good default value like kvm0, kvm1 ... or guest0, guest1 ... User-space can override the name if it wants. The kvm module takes care about the names being unique. This is very much the same as network card numbering is implemented in the kernel. Forcing perf to talk to qemu or even libvirt produces to much overhead imho. Instrumentation only produces useful results with low overhead. Joerg --
So, two users can't have a guest named MyGuest each? What about namespace support? There's a lot of work in virtualizing all kernel namespaces, you're adding to that. What about notifications when guests It's a setup cost only. -- error compiling committee.c: too many arguments to function --
This enumeration is a very small and non-intrusive feature. Making it
Who would be the consumer of such notifications? A 'perf kvm list' can
My statement was not limited to enumeration, I should have been more
clear about that. The guest filesystem access-channel is another
affected part. The 'perf kvm top' command will access the guest
filesystem regularly and going over qemu would be more overhead here.
Providing this in the KVM module directly also has the benefit that it
would work out-of-the-box with different userspaces too. Or do we want
to limit 'perf kvm' to the libvirt-qemu-kvm software stack?
Sidenote: I really think we should come to a conclusion about the
concept. KVM integration into perf is very useful feature to
analyze virtualization workloads.
Thanks,
Joerg
--
I always start my things with bare kvm, It would be very unwelcome to mandate libvirt, or for that matter running a particular userspace in the guest. --
an outsider's comment: this path leads to a filesystem... which could be a very nice idea. it could have a directory for each VM, with pseudo-files with all the guest's status, and even the memory it's using. perf could simply watch those files. in fact, such a filesystem could be the main userleve/kernel interface. but i'm sure such a layour was considered (and rejected) very early in the KVM design. i don't think there's anything new to make it more desirable than it was back then. -- Javier --
It's easier (and safer and all the other boring bits) not to do it at System-wide monitoring needs to work equally well for guests started before or after the monitor. Even disregarding that, if you introduce an API, people will start using it and complaining if it's incomplete. Why? Also, the real cost would be accessing the filesystem, not copying Other userspaces can also provide this functionality, like they have to provide disk, network, and display emulation. The kernel is not a huge Agreed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
For the KVM stack is doesn't matter where it is implemented. It is as
easy in qemu or libvirt as in the kernel. I also don't see big risks. On
the perf side and for its users it is a lot easier to have this in the
kernel.
I for example always use plain qemu when running kvm guests and never
used libvirt. The only central entity I have here is the kvm kernel
Could be easily done using notifier chains already in the kernel.
There is nothing wrong with that. We only need to define what this API
should be used for to prevent rank growth. It could be an
When measuring cache-misses any additional (and in this case
This has nothing to do with a library. It is about entity and resource
management which is what os kernels are about. The virtual machine is
the entity (similar to a process) and we want to add additional access
channels and names to it.
Joerg
--
You can always provide the kernel and module paths as command line parameters. It just won't be transparently usable, but if you're using If we make an API, I'd like it to be generally useful. It's a total headache. For example, we'd need security module hooks to determine access permissions. So far we managed to avoid that since kvm doesn't allow you to access any information beyond what you provided it Copying the objects is a one time cost. If you run perf for more than a second or two, it would fetch and cache all of the data. It's really kvm.ko has only a small subset of the information that is used to define a guest. -- error compiling committee.c: too many arguments to function --
I don't want the tool for myself only. A typical perf user expects that Not necessarily. The perf event is configured to measure systemwide kvm by userspace. The kernel side of perf takes care that it stays system-wide even with added vm instances. So in this case the consumer for the notifier would be the perf kernel part. No userspace interface Thats hard to do at this point since we don't know what people will use it for. We should keep it simple in the beginning and add new features Depends on how it is designed. A filesystem approach was already mentioned. We could create /sys/kvm/ for example to expose information about virtual machines to userspace. This would not require any new I don't think we can cache filesystem data of a running guest on the If two userspaces run in parallel what is the single instance where perf The subset is not small. It contains all guest vcpus, the complete interrupt routing hardware emulation and manages event the guests memory. Joerg --
Someone needs to know about the new guest to fetch its symbols. Or do IMO this use case is to rare to warrant its own API, especially as there Who would set the security context on those files? Plus, we need cgroup I don't see any choice. The guest can change its symbols at any time It doesn't contain most of the mmio and pio address space. Integration with qemu would allow perf to tell us that the guest is hitting the interrupt status register of a virtio-blk device in pci slot 5 (the information is already available through the kvm_mmio trace event, but only qemu can decode it). -- error compiling committee.c: too many arguments to function --
Someone who uses libvirt and virt-manager by default is probably not interested in this feature at the same level a kvm developer is. And developers tend not to use libvirt for low-level kvm development. A number of developers have stated in this thread already that they would appreciate a solution for guest enumeration that would not involve The samples will be tagged with the guest-name (and some additional information perf needs). Perf userspace can access the symbols then An approach like: "The files are owned and only readable by the same user that started the vm." might be a good start. So a user can measure cgroup support is an issue but we can solve that too. Its in general Yeah that would be interesting information. But it is more related to tracing than to pmu measurements. The information which you mentioned above are probably better captured by an extension of trace-events to userspace. Joerg --
So would I. But when I weigh the benefit of truly transparent system-wide perf integration for users who don't use libvirt but do use perf, versus the cost of transforming kvm from a single-process API to a system-wide API with all the complications that I've listed, it comes out in favour of not adding the API. I take that as a yes? So we need a virtio-serial client in the kernel (which might be exploitable by a malicious guest if buggy) and a That's not how sVirt works. sVirt isolates a user's VMs from each other, so if a guest breaks into qemu it can't break into other guests owned by the same user. The users who need this API (!libvirt and perf) probably don't care It's a tradeoff. IMO, going through qemu is the better way, and also It's all related. You start with perf, see a problem with mmio, call up a histogram of mmio or interrupts or whatever, then zoom in on the misbehaving device. -- error compiling committee.c: too many arguments to function --
Its not a transformation, its an extension. The current per-process /dev/kvm stays mostly untouched. Its all about having something like this: $ cd /sys/kvm/guest0 $ ls -l -r-------- 1 root root 0 2009-08-17 12:05 name dr-x------ 1 root root 0 2009-08-17 12:05 fs $ cat name guest0 $ # ... What I meant was: perf-kernel puts the guest-name into every sample and perf-userspace accesses /sys/kvm/guest_name/fs/ later to resolve the symbols. I leave the question of how the guest-fs is exposed to the host If a vm breaks into qemu it can access the host file system which is the bigger problem. In this case there is no isolation anymore. From that context it can even kill other VMs of the same user independent of a Yes, but its different from the implementation point-of-view. For the user it surely all plays together. Joerg --
How I see it: perf-kernel puts the guest pid into every sample, and perf-userspace uses that to resolve to a mountpoint served by fuse, or It cannot. sVirt labels the disk image and other files qemu needs with the appropriate label, and everything else is off limits. Even if you We need qemu to cooperate for mmio tracing, and we can cooperate with qemu for symbol resolution. If it prevents adding another kernel API, that's a win from my POV. -- error compiling committee.c: too many arguments to function --
I am not tied to /sys/kvm. We could also use /proc/<pid>/kvm/ for example. This would keep anything in the process space (except for the We need a bit more information than just the qemu-pid, but yes, this Thats true. Probably qemu can inject this information in the kvm-trace-events stream. Joerg --
How about ~/.qemu/guests/$pid? -- error compiling committee.c: too many arguments to function --
That makes it hard for perf to find it and even harder to get a list of all VMs. With /proc/<pid>/kvm/guest we could symlink all guest directories to /proc/kvm/ and perf reads the list from there. Also perf can easily derive the directory for a guest from its pid. Last but not least its kernel-created and thus independent from the userspace part being used. Joerg --
Doesn't perf already has a dependency on naming conventions for finding debug information? -- error compiling committee.c: too many arguments to function --
Not so trival and even more likely to break. Even it perf has the pid of the process and wants to find the directory it has to do: 1. Get the uid of the process 2. Find the username for the uid 3. Use the username to find the home-directory Steps 2. and 3. need nsswitch and/or pam access to get this information from whatever source the admin has configured. And depending on what the source is it may be temporarily unavailable causing nasty timeouts. In short, there are many weak parts in that chain making it more likely to break. A kernel-based approach with /proc/<pid>/kvm does not have those issues (and to repeat myself, it is independent from the userspace being used). Joerg --
It's true. If the kernel provides something, there are fewer things that can break. But if your system is so broken that you can't resolve uids, fix that before running perf. Must we design perf for that case? After all, 'ls -l' will break under the same circumstances. It's hard It has other issues, which are IMO more problematic. -- error compiling committee.c: too many arguments to function --
Also, perf itself will hang if it needs to access a file using autofs or nfs, and those are broken. -- error compiling committee.c: too many arguments to function --
uid to username can fail when using chroots, or worse point to an incorrect location (and yes, I do use this) Sorry if this has been covered / discussion has moved on. Just catching up with the 500+ messages in my inbox.. --
It looks at several places, from most symbol rich (/usr/lib/debug/, aka -debuginfo packages, where we have full symtabs) to poorest (the packaged binary, where we may just have a .dynsym). In an ideal world, it would just get the build-id (a SHA1 cookie that is in an ELF session inserted in every binary (aka DSOs), kernel module, kallsyms or vmlinux file) and use that to look first in a local cache (implemented in perf for a long time already) or in some symbol server. For instance, for a random perf.data file I collected here in my machine I have: [acme@doppio linux-2.6-tip]$ perf buildid-list | grep libpthread 5c68f7afeb33309c78037e374b0deee84dd441f6 /lib64/libpthread-2.10.2.so [acme@doppio linux-2.6-tip]$ So I don't have to access /lib64/libpthread-2.10.2.so directly, nor some convention to get a debuginfo in a local file like: /usr/lib/debug/lib64/libpthread-2.10.2.so.debug Instead the tools look at: [acme@doppio linux-2.6-tip]$ l ~/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 lrwxrwxrwx 1 acme acme 73 2010-01-06 18:53 /home/acme/.debug/.build-id/5c/68f7afeb33309c78037e374b0deee84dd441f6 -> ../../lib64/libpthread-2.10.2.so/5c68f7afeb33309c78037e374b0deee84dd441f6* To find the file for that specific build-id, not the one installed in my machine (or on the different machine, of a different architecture) that may be completely unrelated, a new one, or one for a different arch. - Arnaldo --
Thanks. I believe qemu could easily act as a symbol server for this use case. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Agreed, but it doesn't even have to :-) We just need to get the build-id in the PERF_RECORD_MMAP event somehow and then get this symbol from elsewhere, say the same DVD/RHN channel/Debian Repository/embedded developer toolkit image not stripped/whatever. Or it may already be in the local cache from last week's perf report session :-) - Arnaldo --
I spent a couple of days to investigate why sshfs/fuse doesn't work well with procfs and sysfs. Just after my patch against fuse is ready almost, I found fuse already supports such access by direct I/O. With parameter -o direct_io, it could work well. Here is an example to mount / from a guest os. #sshfs -p 5551 -o direct_io localhost:/ guestmount We can read files and write files if permission is ok. I will go ahead to support multiple guest os instance statistics parsing. Yanmin --
No it can't. With sVirt every single VM has a custom security label and the policy only allows it access to disks / files with a matching label, and prevents it attacking any other VMs or processes on the host. THis confines the scope of any exploit in QEMU to those resources the admin has explicitly assigned to the guest. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| --
Even better. So a guest which breaks out can't even access its own /sys/kvm/ directory. Perfect, it doesn't need that access anyway. Joerg --
But what security label does that directory have? How can we make sure that whoever needs access to those files, gets them? Automatically created objects don't work well with that model. They're simply missing information. -- error compiling committee.c: too many arguments to function --
If we go the /proc/<pid>/kvm way then the directory should probably inherit the label from /proc/<pid>/? Same could be applied to /sys/kvm/guest/ if we decide for it. The VM is still bound to a single process with a /proc/<pid> after all. Joerg --
That's a security policy. The security people like their policies outside the kernel. For example, they may want a label that allows a trace context to read Ditto. -- error compiling committee.c: too many arguments to function --
Hm, I am not a security expert. But is this not only one entity more for sVirt to handle? I would leave that decision to the sVirt developers. Does attaching the same label as for the VM resources mean that root could not access it anymore? Joerg --
IIUC processes run under a context, and there's a policy somewhere that tells you which context can access which label (and with what permissions). There was a server on the Internet once that gave you root access and invited you to attack it. No idea if anyone succeeded or not (I got bored after about a minute). So it depends on the policy. If you attach the same label, that means all files with the same label have the same access permissions. I think. -- error compiling committee.c: too many arguments to function --
So if this is true we can introduce a 'trace' label and add all contexts that should be allowed to trace to it. But we probably should leave the details to the security experts ;-) Joerg --
That's just what I want to do. Leave it in userspace and then they can deal with it without telling us about it. -- error compiling committee.c: too many arguments to function --
They can't do that with a directory in /proc? --
I don't know. -- error compiling committee.c: too many arguments to function --
I'd much prefer a pid like suggested later, keeps the samples smaller. But that said, we need guest kernel events like mmap and context switches too, otherwise we simply can't make sense of guest userspace addresses, we need to know the guest address space layout. So aside from a filesystem content, we first need mmap and context switch events to find the files we need to access. And while I appreciate all the security talk, its basically pointless anyway, the host can access it anyway, everybody agrees on that, but still you're arguing the case.. --
This only works for the guest kernel, we don't know anything about guest root can access anything, but we're not talking about root. The idea is to protect against a guest that has exploited its qemu and is now attacking the host and its fellow guests. uid protection is no good since we want to isolate the guest from host processes belonging to the same uid and from other guests running under the same uid. [1] We can find out guest pids if we teach the kernel what to dereference, i.e. gs:offset1->offset2->offset3. Of course this varies from kernel to kernel, so we need some kind of bytecode that we can run in perf nmi context. Kind of what we need to run an unwinder for -fomit-frame-pointer. -- error compiling committee.c: too many arguments to function --
With the filesystem approach all we need is the pid of the guest process. Then we can access proc/<pid>/maps of the guest and read out the address space layout, no? Joerg --
No, what if it maps new things after you read it? But still getting the pid of the guest process seems non trivial without guest kernel support. --
How about we add a virtio "guest file system access" device? The guest would then expose its own file system using that device. On the host side this would simply be a -virtioguestfs unix:/tmp/guest.fs and you'd get a unix socket that gives you full access to the guest file system by using commands. I envision something like: SEND: GET /proc/version RECV: Linux version 2.6.27.37-0.1-default (geeko@buildhost) (gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP 2009-10-15 14:56:58 +0200 Now all we need is integration in perf to enumerate virtual machines based on libvirt. If you want to run qemu-kvm directly, just go with --guestfs=/tmp/guest.fs and perf could fetch all required information automatically. This should solve all issues while staying 100% in user space, right? Alex --
The idea is to use a dedicated channel over virtio-serial. If the Yeah, needs a fuse filesystem to populate the host namespace (kind of sshfs over virtio-serial). -- error compiling committee.c: too many arguments to function --
The file server being a kernel module inside the guest? We want to be able to serve things as early and hassle free as possible, so in this I don't see why we need a fuse filesystem. We can of course create one later on. But for now all you need is a user connecting to that socket. Alex --
No, just a daemon. If it's important enough we can get distributions to package it by default, and then it will be hassle free. If "early enough" is also so important, we can get it to start up on initrd. If If the perf app knows the protocol, no problem. But leave perf with pure filesystem access and hide the details in fuse. -- error compiling committee.c: too many arguments to function --
Agreed. I especially would like to see instruction/branch tracing working this way. This would a lot of the benefits of a simulator on a real CPU. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
If you're profiling a single guest it makes more sense to do this from inside the guest - you can profile userspace as well as the kernel. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I'm interested in debugging the guest without guest cooperation. In many cases qemu's new gdb stub works for that, but in some cases I would prefer instruction/branch traces over standard gdb style debugging. I used to use that very successfully with simulators in the past for some hard bugs. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Isn't gdb supposed to be able to use branch traces? It makes sense to expose them via the gdb stub then. Not to say an external tool doesn't make sense. -- error compiling committee.c: too many arguments to function --
AFAIK not. The ptrace interface is only used by idb I believe. I might be wrong on that. Not sure if there is even a remote protocol command for branch traces either. There's a concept of "tracepoints" in the protocol, but it Ok that would work for me too. As long as I can set start/stop triggers and pipe the log somewhere it's fine for me. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Sigh, why am I drawn into this. A person who uses dishonest arguments is a dishonest person. When you say I use a dishonest argument you are implying I am dishonest. Why do you argue with me at all if you think I am trying to cheat? If you disagree with me, tell me I am wrong, not dishonest (or that my arguments are dishonest). And this is just one example in this thread. Seriously, tools/kvm would cause a loss of developers, not a gain, simply because of the style of argument of some people on this list. Maybe qemu/kernels is a better idea. Again, if you want to talk to me, use the same language you'd like to hear yourself. Or maybe years of lkml made you so thick skinned you no longer understand how to interact with people. -- error compiling committee.c: too many arguments to function --
That's not how i understood that phrase - and i did not mean to suggest that you are dishonest and i do not think that you are dishonest (to the contrary). Thanks, Ingo --
Word games. -- error compiling committee.c: too many arguments to function --
This third category is pretty well served by virt-manager. It has its quirks and shortcomings, but at least it exists. Paolo --
If that is the theory then it has failed to trickle through in practice. As you know i have reported a long list of usability problems with hardly a look. That list could be created by pretty much anyone spending a few minutes of getting a first impression with qemu-kvm. So something is seriously wrong in KVM land, to pretty much anyone trying it for the first time. I have explained how i see the root cause of that, while you seem to suggest that there's nothing wrong to begin with. I guess we'll have to agree to disagree on that. Thanks, Ingo --
I think the point you're missing is that your list was from the perspective of someone looking at a desktop virtualization solution that had was graphically oriented. As Avi has repeatedly mentioned, so far, that has not been the target audience of QEMU. The target audience tends to be: 1) people looking to do server virtualization and 2) people looking to do command line based development. Usually, both (1) and (2) are working on machines that are remotely located. What's important to these users is that VMs be easily launchable from the command line, that there is a lot of flexibility in defining machine types, and that there is a programmatic way to interact with a given instance of QEMU. Those are the things that we've been focusing on recently. The reason we don't have better desktop virtualization support is simple. No one is volunteering to do it and no company is funding development for it. When you look at something like VirtualBox, what you're looking at is a long ago forked version of QEMU with a GUI added focusing on desktop virtualization. There is no magic behind adding a better, more usable GUI to QEMU. It just takes resources. I understand that you're trying to make the point that without catering to the desktop virtualization use case, we won't get as many developers as we could. Personally, I don't think that argument is accurate. If you look at VirtualBox, it's performance is terrible. Having a nice GUI hasn't gotten them the type of developers that can improve their performance. No one is arguing that we wouldn't like to have a nicer UI. I'd love to merge any patch like that. Regards, Anthony Liguori --
Can you transfer your list to the following wiki page: http://wiki.qemu.org/Features/Usability This thread is so large that I can't find your note that contained the initial list. I want to make sure this input doesn't die once this thread settles down. Regards, Anthony Liguori --
It does happen in practice, just not in the GUI areas, since no one is working on them. I am not going to condition a qcow2 reliability fix to Not anyone trying it for the first time. RHEV-M users will see a polished GUI that can be used to manage thousands of guests and hosts. I presume IBM and Siemens (and all other contributors) users will also enjoy a good user experience with their respective products. Qemu is not the only GUI for kvm. So far only one company was interested in a qemu GUI - the makers of virtualbox. Unfortunately they chose not to contribute that back to qemu, and no one was sufficiently motivated to pick out the bits and try to merge them. Again, if you are interested in a qemu GUI, you either have to write it yourself or convince someone else to do it. My own plate is full and my priorities are clear. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Using DRM doesn't help very much. You still need an X driver and most of the operations you care about (video rendering, window movement, etc) are not operations that need to go through DRM. 3D graphics virtualization is extremely difficult in the non-passthrough case. It really requires hardware support that isn't widely available It doesn't provide the things we need to a good user experience. You need things like an absolute input device, host driven display resize, RGBA hardware cursors. None of these go through DRI and it's those I don't know why you keep saying this. The people who are in these "separate communities" keep claiming that they don't feel this way. I'm not just saying this to be argumentative. Many of the people in the community have thought this same thing, and tried it themselves, and we've all come to the same conclusion. It's certainly possible that we just missed the obvious thing to do but If this is true, please demonstrate it. Prove your point with patches Nah, instead we can just have a few hundred mail thread on the list. Otherwise we'd have to write patches and do other kinds of productive work. Regards, --
For the full-screen case (which is a very common mode of using a guest OS on the desktop) there's not much of window management needed. You need to With KSM the display resize is in the kernel. Cursor management is not. Yet: i think it would be a nice feature as the cursor could move even if Xorg is If you are not two separate communities but one community, then why do you go through the (somewhat masochistic) self-punishing excercise of keeping the project in two different pieces? In a distant past Qemu was a separate project and KVM was just a newcomer who used it for fancy stuff. Today as you say(?) the two communities are one and I'm not aware of anyone in the past having attempted to move qemu to tools/kvm/ in the uptream kernel repo, and having reported on the experiences with such a contribution setup. (obviously it's not possible at all without heavy cooperation and acceptance from you and Avi, so this will probably remain a thought experiment forever) If then you must refer to previous attempts to 'strip down' Qemu, right? Those attempts didnt really solve the fundamental problem of project code base separation. Ingo --
Implementing a virtualized DRM/KMS driver would at least get you the framebuffer interface more or less for free, while allowing you to deal with the userspace side of things incrementally (ie, running a dummy xorg on top of the virtualized fbdev until the DRI side catches up). It would None of these things negate the benefit one would get from a virtualized DRM/KMS driver either. There are multiple problems that need solving in this area, and it's a bit disingenuous to discount a valid suggestion out of hand due to the fact it doesn't solve all of the outstanding issues. --
Guys, have a look at Gallium. In many ways it's a pile of crap, but at least it's a pile of crap designed by vmware for *exactly* your problem space. OG. --
Or perhaps Chromium, which was designed years ago and can pass-through OpenGL commands via a pipe. VirtualBox uses it for their PV drivers. Naturally it is not a FB, just a OpenGL command pass-through interface. --
Why does Linux AIO still suck? Why do we not have a proper interface in userspace for doing asynchronous file system operations? Why don't we have an interface in userspace to do zero-copy transmit and receive of raw network packets? The lack of a decent userspace API for asynchronous file system operations is a huge usability problem for us. Take a look at the complexity of our -drive option. It's all because the kernel gives us sucky interfaces. Regards, Anthony Liguori --
I think you're increasing the height of that wall by arguing that a userspace project cannot be successful because it's development process sucks and the only way to fix it is to put it into the kernel where people know so much better. Instead we kernel developers should listen to requirements from users, even if their code isn't in tools/. -- error compiling committee.c: too many arguments to function --
No, it's tearing down that wall because finally, instead of providing rather abstract system calls that are designed perfectly, the kernel can operate by providing useful libraries and apps. At least on the context i've worked on it has torn down walls and has improved the efficiency of working on ABIs towards user-space. (sysprof is an example of that) Kernel developers are finally faced with user-space development directly, in the same repository, using the same rules of contribution. Non-kernel-hosted apps win from that process too, as even if they dont integrate (because they dont want to or cannot for license reasons) they can participate in a more direct (and more practical) exchange with kernel developers. They can contribute a new system call and create a library function for it straight away. Ingo --
Good that you mention it, i think it's an excellent example.
The suckage of kernel async IO is for similar reasons: there's an ugly package
separation problem between the kernel and between glibc - and between the apps
that would make use of it.
( With the separated libaio it was made worse: there were 3 libraries to
work with, and even less applications that could make use of it ... )
So IMO klibc is an arguably good idea - eventually hpa will get around posting
it for upstream merging again. Then we could offer both new libraries much
faster, and could offer things like comprehensive AIO used pervasively within
If you had your bits in tools/kvm/ you could make a strong case for a good
kaio implementation - coupled with an actual, working use-case. ( You could
use the raw syscall even without klibc. )
We could see the arguments on lkml turn from:
'do we want this and it will take years to propagate this into apps'
into something like:
' Exactly how much faster does kvm go? and I'd get is straight away with my
next kernel update tomorrow? Wow! '
Ok, i exaggerated a bit - but you get the idea. It's a much different picture
when kernel developers and maintainers see an actual use-case, _right in the
kernel repo they work with every day_.
Currently there's a wall between kernel developers and user-space developers,
and there's somewhat of an element of fear and arrogance on both sides. For
efficient technology such walls needs torn down and people need a bit more
experience with each other's areas.
Ingo
--
And why wouldn't the kernel developers produce posix-aio within klibc. posix-aio is also a really terrible interface (although not as bad as linux-aio). The reason boils down to the fact that these interfaces are designed without interacting with the consumers. Part of the reason for that is the attitude of the community. You approached this discussion with, "QEMU/KVM sucks, you should move into the kernel because we're awesome and we'd fix everything in a heart beat". That attitude does not result in any useful collaboration. Had you started trying to understand what the problems that we face are and whether there's anything that can be done in the kernel to improve it, it would have been an entirely different discussion. The sad thing is, QEMU is probably one of the most demanding free software applications out there today with respect to performance. We consume interfaces IO interfaces and things like large pages in a deeper way than just about any application out there. We've been trying for a long time to improve Linux interfaces for years but we've not had many people in the kernel community be receptive. We've failed to improve the userspace networking interfaces. Compare Rusty's posting of vringfd to vhost-net. They are the same interface except we tried to do something more generally useful with vringfd and it was shot down because it was "yet another kernel/userspace data transfer interface". Unfortunately, we're learning that if we claim something is virtualization specific, we avoid a lot of the kernel bureaucracy. My concern is that over time, we'll have more things like vhost and that's bad for everyone. Regards, Anthony Liguori --
No, kernel async IO sucks because it still does not play well with buffered I/O. Last time I checked (about a year ago or so), AIO syscall latencies were much worse when buffered I/O was used compared to direct I/O. Unfortunately, to achieve good performance with direct I/O, you need a HW RAID card with lots of on-board cache. Gabor --
Ingo, what you miss is that this is not a bad thing. Fact of the matter is, it's not just painful, it downright sucks. This is actually a Good Thing (tm). It means you have to get your feature and its interfaces well defined and able to version forwards and backwards independently from each other. And that introduces some complexity and time and testing, but in the end it's what you want. You don't introduce a requirement to have the feature, but take advantage of it if it is there. It may take everyone else a couple years to upgrade the compilers, tools, libraries and kernel, and by that time any bugs introduced by interacting with this feature will have been ironed out and their patterns well known. If you haven't well defined and carefully thought out the feature ahead of time, you end up creating a giant mess, possibly the need for nasty backwards compatibility (case in point: COMPAT_VDSO). But in the end, you would have made those same mistakes on your internal tree anyway, and then you (or likely, some other hapless project maintainer for the project you forked) would have to go add the features, fixes and workarounds back to the original project(s). However, since you developed in an insulated sheltered environment, those fixes and workarounds would not be robust and independently versionable from each other. The result is you've kept your codebase version-neutral, forked in outside code, enhanced it, and left the hard work of backporting those changes and keeping them version-safe to the original package maintainers you forked from. What you've created is no longer a single project, it is called a distro, and you're being short-sighted and anti-social to think you can garner more support than all of those individual packages you forked. This is why most developers work upstream and let the goodness propagate down from the top like molten sugar of each granular package on a flan where it is collected from the rich custard ...
Our experience is the opposite, and we tried both variants and report about our experience with both models honestly. You only have experience about one variant - the one you advocate. Sorry, but this is pain not true. The 2.4->2.6 kernel cycle debacle has taught us that waiting long to 'iron out' the details has the following effects: - developer pain - user pain - distro pain - disconnect - loss of developers, testers and users - grave bugs discovered months (years ...) down the line - untested features - developer exhaustion It didnt work, trust me - and i've been around long enough to have suffered through the whole 2.5.x misery. Some of our worst ABIs come from that cycle as well. So we first created the 2.6.x process, then as we saw that it worked much better we _sped up_ the kernel development process some more, to what many claimed was an impossible, crazy pace: two weeks merge window, 2.5 months stabilization and a stable release every 3 months. And you can also see the countless examples of carefully drafted, well thought out, committee written computer standards that were honed for years, which are not worth the paper they are written on. 'extra time' and 'extra buerocratic overhead to think things through' is about the worst thing you can inject into a development process. You should think about the human brain as a cache - the 'closer' things are both in time and pyshically, the better they end up being. Also, the more gradual, the more concentrated a thing is, the better it works out in general. This is part of the basic human nature. Sorry, but i really think you are really trying to rationalize a disadvantage here ... Ingo --
You're talking about a single project and comparing it to my argument about multiple independent projects. In that case, I see no point in the discussion. If you want to win the argument by strawman, you are This could very well be true, but until someone comes forward with compelling numbers (as in, developers committed to working on the project, number of patches and total amount of code contribution), there is no point in having an argument, there really isn't anything to discuss other than opinion. My opinion is you need a really strong justification to have a successful fork and I don't see that justification. Zach --
The kernel is a very complex project with many ABI issues, so all those arguments apply to it as well. The description you gave: | This is actually a Good Thing (tm). It means you have to get your feature | and its interfaces well defined and able to version forwards and backwards | independently from each other. And that introduces some complexity and | time and testing, but in the end it's what you want. You don't introduce a | requirement to have the feature, but take advantage of it if it is there. matches the kernel too. We have many such situations. (Furthermore, the tools/perf/ situation, which relates to ABIs and user-space/kernel-space interactions is similar as well.) I can give you rough numbers for tools/perf - if that counts for you. For the first four months of its existence, when it was a separate project, i had a single external contributor IIRC. The moment it went into the kernel repo the number of contributors and contributions skyrocketed and basically all contributions were top-notch. We are at 60+ separate contributors now (after about 8 months upstream) - which is still small compared to the kernel or to Qemu, but huge for a relatively isolated project like instrumentation. So in my estimation tools/kvm/ would certainly be popular. Whether it would be more popular than current Qemu is hard to tell - it would be pure speculation. Any reliable numbers for the other aspect, whether a split project creates a more fragile and less developed ABI would be extremely hard to get. I believe it to be true, but that's my opinion based on my experience with other projects, extrapolated to KVM/Qemu. Anyway, the issue is moot as there's clear opposition to the unification idea. Too bad - there was heavy initial opposition to the arch/x86 unification as well [and heavy opposition to tools/perf/ as well], still both worked out extremely well :-) Ingo --
Did you forget that arch/x86 was a merging of a code fork that happened several years previously? Maybe that fork shouldn't have been done to begin with. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
We discussed and probably timidly tried to share the sharable initially but we realized it was too time wasteful. In addition to having to adapt the code to 64bit we would also had to constantly solve another problem on top of it (see the various split on _32/_64, those takes time to achieve, maybe not huge time but still definitely some time and effort). Even in retrospect I am quite sure the way x86-64 happened was optimal and if we would go back we would do it again the exact same way even if the final object was to have a common arch/x86 (and thankfully Linus is flexible and smart enough to realize that code that isn't risking to destabilize anything shouldn't be forced out just because it's not to a totally theoretical-perfect-nitpicking-clean-state yet). It's still a lot of work do the unification later as a separate task, but it's not like if we did it immediately it would have been a lot less work. It's about the same amount of effort and we were able to defer it for later and decrease the time to market which surely has contributed to the success of x86-64. Problem of qemu is not some lack of GUI or that it's not included in the linux kernel git tree, the definitive problem is how to merge qemu-kvm/kvm and qlx into it. If you (Avi) were the qemu maintainer I am sure there wouldn't two trees so as a developer I would totally love it, and I am sure that with you as maintainer it would have a chance to move forward with qlx on desktop virtualization without proposing to extend vnc instead to achieve a "similar" result (imagine if btrfs is published on a website and people starts to discuss if it should ever be merged ever because reinventing some part of btrfs inside ext5 might achieve ""similar"" results). About a GUI for KVM to use on desktop distributions, that is an irrelevant concern compared to the lack of protocol more efficient than rdesktop/rdp/vnc for desktop virtualization. I've people asking me to migrate hundreds of desktops to desktop virtualization on ...
In hindsight decisions are much easier. I agree it was less risky to fork than to share. But if another instruction set forks out a 64-bit not-exactly-compatible variant, I'm sure we'll start out shared and not The qemu/qemu-kvm fork is definitely hurting. Some history: when kvm started out I pulled qemu for fast hacking and, much like arch/x86_64, I couldn't destabilize qemu for something that was completely experimental (and closed source at the time). Moreover, it wasn't clear if the qemu community would be interested. The qemu-kvm fork was designed for minimal intrusion so I could merge upstream qemu regularly. This resulted in kvm integration that was fairly ugly. Later Anthony merged a well-integrated alternative implementation (in retrospect this was a mistake IMO - we were left with a well tested high performing ugly implementation and a clean, slow, untested, and unfeatured implementation, and no one who wants to merge the two). So now it is pretty confusing to read the code which has the Anyone can focus on what interests them, if someone has an interest in a good desktop-on-desktop experience they should start hacking and sending patches. -- error compiling committee.c: too many arguments to function --
To the contrary, experience shows that repository location, and in particular a shared repository for closely related bits is very much material! It matters because when there are two separate projects, even a "serious developer" is finding it double and triple difficult to contribute even trivial changes. It becomes literally a nightmare if you have to touch 3 packages: kernel, a library and an app codebase. It takes _forever_ to get anything useful done. Also, 'focus on a single thing' is a very basic aspect of humans, especially those who do computer programming. Working on two code bases in two repositories at once can be very challenging physically and psychically. So what i've seen is that OSS programmers tend to pick a side, pretty much randomly, and then rationalize it in hindsight why they prefer that side ;-) Most of them become either a kernel developer or a user-space package developer - and then they specialize on that field and shy away from changes that involve both. It's a basic human thing to avoid the hassle that comes with multi-package changes. (One really has to be outright stupid, fanatic or desperate to even attempt such changes these days - such are the difficulties for a comparatively low return.) The solution is to tear down such artificial walls of contribution where possible. And tearing down the wall between KVM and qemu-kvm seems very much possible and the advantages would be numerous. Unless by "serious developer" you meant: "developer willing to [or forced to] Then you'll be surprised to hear that it's happening as we speak and the commits are there in linux-2.6.git. Both a TUI and GUI is in the works. Furthermore, the numbers show that half of the usability fixes to tools/perf/ came not from regular perf contributors but from random kernel developers and testers who when they build the latest kernel and try out perf at the same time (it's very easy because you already have it in the kernel repository - no ...
You can't be serious. I find that the difficulty in contributing a patch has mostly to do with writing the patch, and less with figuring Indeed, working simultaneously on two different projects is difficult. I usually work for a while on one, and then 'cd', physically and psychically, to the other. Then switch back. Sort of like the We have a large number of such stupid, fanatic, desperate developers in By "serious developer" I mean - someone who is interested in contributing, not in getting their name into the kernel commits list - someone who is willing to read the wiki page and find out where the repository and mailing list for a project is - someone who will spend enough time on the project so that the time to clone two repositories will not be a factor in their contributions - someone who will work on the uncool stuff like fixing bugs and Let's wait and see then. If the tools/perf/ experience has really good results, we can reconsider this at a later date. -- error compiling committee.c: too many arguments to function --
My own experience and everyone i've talked about such topics (developers and distro people) about feature contribution tells the exact opposite: it's much harder to contribute features to multiple packages than to a single project. kernel+library+app features take forever to propagate, and there's constant fear of version friction, productization deadlines are uncertain and ABI messups are frequent as well due to disjoint testing. Also, each component has essential veto power: so if the proposed API or approach is opposed or changed in a later stage then that affects (sometimes already committed) changes. If you've ever done it you'll know how tedious it is. This very thread and recent threads about KVM usability demonstrate the same complications. Thanks, Ingo --
I'm not going to argue about the Qemu merging here. But your above assessment is incomplete. It is not because developers don't want to clone two different trees that tools/perf is a success. Or may be it's a factor but I suspect it to be very minimal. I can script git commands if needed. It is actually because both kernel and user side are I think it has already really good results. --
This argues that co-evolution of an interface is easiest on the developers if they own both sides of that interface. No quarrel. This does not argue that that the preservation of a stable ABI is best done this way. If anything, it makes it too easy to change both the provider and the preferred user of the interface without noticing unintentional breakage to forlorn out-of-your-tree clients. - FChE --
Your concern is valid, and this issue has been raised in the past as one of the main counter-arguments against tools/perf/. (there was a big flamewar about it on lkml when it was introduced) Our roughly 1 year experience with perf is that, somewhat pradoxially, this scheme not only works as well as classic ABI schemes but actually brings a _better_ ABI than the classic "let the kernel define an ABI" single-sided solution. I know the difference first hand, i've written various syscalls ABIs in the past 10+ years before perf and know how they interact with their user space counterparts. Why did it work out better with tools/perf/? It turns out that there's an immediate, direct, actionable test feedback effect on the ABI, and much closer relation to the ABI. Typically the same developer implements the kernel bits and the user-space bits (because it's so easy to do co-development), so the ABI aspects are ingrained in the developer much more deeply. Once you see the kind of havoc ABI breakage can cause during development you avoid it in the future. So developers find that a good, stable ABI helps development. It turns out that developers dont actually _want_ to break the ABI and are careful about it - and having the app next to the kernel ABI and co-developing it makes it sure there's never any true mismatch. Also, we can do ABI improvements at a far higher rate than any other kernel subsystem. I checked the git logs, we've done over three dozen ABI extensions since the first version, and all were forwards _and_ backwards compatible. A higher rate of change gives developers more experience and lets them do a better ABI, and makes them more ABI-conscious. I think if all kernel ABIs had such a healthy rate of change we'd fill in all the missing kernel features very quickly. With detached packages ABI features are often done by a kernel developer (who is familar with the kernel subsystem in question) and a separate user-space developer (who is ...
Ingo, What made KVM so successful was that the core kernel of the hypervisor was designed the right way, as a kernel module where it belonged. It was obvious to anyone who had been exposed to the main competition at the time, Xen, that this was the right approach. What has ended up killing Xen in the end is the not-invented-here approach of copying everything over, reformatting it, and rewriting half of it, which made it impossible to maintain and support as a single codebase. At my previous employer we ended up dropping all Xen efforts exactly because it was like maintaining two separate operating system kernels. The key to KVM Well there are two ways to go about this. Either you base the KVM userland on top of an existing project, like QEMU, _or_ you rewrite it all from scratch. However, there is far more to it than just a couple of ioctls, for example the stack of reverse device-drivers is a pretty significant code base, rewriting that and maintaining it is not a trivial task. It is certainly my belief that the benefit we get from sharing that with QEMU by far outweighs the cost of forking it and keeping our own fork in the kernel tree. In fact it would result in With this you have just thrown away all the benefits of having the QEMU repository shared with other developers who will actively fix bugs in Now that would be interesting, next we'll have to include things like libxml in the kernel git tree as well, to make sure libvirt doesn't get So far your argument would justify pulling all of gdb into the kernel git tree as well, to support the kgdb efforts, or gcc so we can get rid of the gcc version quirks in the kernel header files, e2fsprogs and equivalent for _all_ file systems included in the kernel so we can make sure our fs tools never get out of sync with whats supported in the The user components for perf vs oprofile are _tiny_ projects compared to the portions of QEMU that are actually used by KVM. Oh and you completely forgot SeaBIOS. KVM+QEMU rely ...
Yes, exactly. Yes. Please realize that what is behind it is a strikingly simple argument: Btw., i made similar arguments to Avi about 3 years ago when it was going upstream, that qemu should be unified with KVM. This is more true today than I do not suggest forking Qemu at all, i suggest using the most natural My experience as an external observer of the end result contradicts this. Seemingly trivial usability changes to the KVM+Qemu combo are not being done often because they involve cross-discipline changes. ( _In this very thread_ there has been a somewhat self-defeating argument by Anthony that multi-package scenario would 'significantly complicate' matters. What more proof do we need to state the obvious? Keeping what has become one piece of technology over the years in two separate halves is The way we have gone about this in tools/perf/ is similar to the route picked by Git: we only use very lowlevel libraries available everywhere, and we provide optional wrappers to the rest. We are also using the kernel's libraries so we rarely need to go outside to get some functionality. I.e. it's a non-issue in practice and despite perf having an (optional) dependency on xmlto and docbook we dont include those packages nor do we force gdb and gcc is clearly extrinsic to the kernel so why would we move them there? I was talking about tools that are closely related to the kernel - where much of the development and actual use is in combination with the Linux kernel. 90%+ of the Qemu usecases are combined with Linux. (Yes, i know that you can run Qemu without KVM, and no, i dont think it matters in the grand scheme of things and most investment into Qemu comes from the KVM angle these days. In particular it for sure does not justify handicapping future KVM evolution so I know the size and scope of Qemu, i even hacked it - still my points remain. SeaBIOS is in essence a firmware, so it could either be loaded as such. Just look ...
Thats a very glorified statement but it's not reality, sorry. You can do that with something like perf because it's so small and development of If you are not suggesting to fork QEMU, what are you suggesting then? You don't seriously expect that the KVM community will be able to mandate that the QEMU community switch to the Linux kernel repository? That would be like telling the openssl developers that they should merge with glibc and start working out of the glibc tree. What you are suggesting is *only* going to happen if we fork QEMU, there is zero chance to move the main QEMU repository into the Linux kernel tree. And trust me, you don't want to have Linus having to deal with You still haven't explained how you expect create a unified KVM+QEMU What I have seen you complain about here is the lack of a good end user GUI for KVM. However that is a different thing. So far no vendor has put significant effort into it, but that is nothing new in Linux. We have a great kernel, but our user applications are still lacking. We have 217 CD players for GNOME, but we have no usable calendering application. A good GUI for virtualization is a big task, and whoever designs it will base their design upon their preferences for whats important. A lot of spare time developers would clearly care most about a gui installation and fancy icons to click on, whereas server users would be much more interested in automation and remote access to the systems. For a good example of an incomplete solution, try installing Fedora over a serial line, you cannot do half the things without launching VNC :( Getting a comprehensive solution for this that would satisfy the bulk of the users would be a huge chunk of code in the kernel tree. Imagine the screaming that would result in? How often have we not had the moaning from x86 users who wanted to rip out all the non x86 code to reduce the size of Did you ever look at what libvirt actually does and what it offers? Or how about the various libraries ...
I was not talking about just perf: i am also talking about the arch/x86/ unification which is 200+ KLOC of highly non-trivial kernel code with hundreds of contributors and with 8000+ commits in the past two years. Also, it applies to perf as well: people said exactly that a year ago: 'perf has it easy to be clean as it is small, once it gets as large as Oprofile tooling it will be in the same messy situation'. Today perf has more features than Oprofile, has a larger and more complex code base, has more contributors, and no, it's not in the same messy situation at all. So whatever you think of large, unified projects, you are quite clearly mistaken. I have done and maintained through two different types of unifications and the experience was very similar: both developers and users (and maintainers) are much better off. Ingo --
Sorry but you cannot compare merging two chunks of kernel code that originated from the same base, with the efforts of mixing a large Both perf and oprofile are still relatively small projects in comparison You believe that I am wrong in my assessment of unified projects, and I obviously think you are mistaken and underestimating the cost and effects of trying to merge the two. Well I think we are just going to agree to disagree on this one. I am not against merging projects where it makes sense, but in this particular case I am strongly convinced the loss would be much greater than the gain. Cheers, Jes --
That's true to a certain degree, but combined with the perf experience it's all rather clear. Similar arguments were made against the x86 unification and against perf. Similar arguments were made against KVM and in favor of Xen years ago - back when few of you knew about it ;-) These are all repeating patterns in my experience. You could fairly contrast that with a _failed_ unification perhaps - but i'm not aware of any such failed unification. (please educate me if you are) The thing is, unifications are rare in the OSS space not because they dont make sense technically (to the contrary), they are rare due to blind inertia (why change if we managed to muddle through with the current scheme?) and to a certain degree due to the egos involved ;-) As such we have a proliferation of packages in Linux, and we'd be much better off in a more focused fashion. And whenever i see that in the kernel's context So is your argument that the unification does not make sense due to size? I wish you said that based on first hand negative experience with unifications, not based on just pure speculation. (and yes, i speculate too, but at least with some basis) Ingo --
As I have stated repeatedly in this discussion, a unification would hurt the QEMU development process because it would alienate a large number of QEMU developers who are *not* Linux kernel users. You still haven't given us a *single* example of unification of something that wasn't purely linked to the Linux kernel. perf/ oprofile is 100% linked to the Linux kernel, QEMU is not. I wish you would actually look at what users use QEMU for. As long as you continue to purely speculate on this, to use your own words, your arguments are not holding up. And you are not being consistent either. You have conveniently continue to ignore my questions about why the file system tools are not to be merged into the Linux kernel source tree? Jes --
I took a quick look at the qemu.git log and more than half of all recent contributions came from Linux distributors. So without KVM Qemu would be a much, much smaller project. It would be similar The stats show that the huge increase in Qemu contributions over the past few years was mainly due to KVM. Do you claim it wasnt? What other projects make Sorry, i didnt comment on it because the answer is obvious: the file system tools and pretty much any Linux-exclusive tool (such as udev) should be moved there. The difference is that there's not much active development done in most of those tools so the benefits are probably marginal. Both Qemu and KVM is being developed very actively though, so development model inefficiencies show up. Anyway, i didnt think i'd step into such a hornet's nest by explaining what i see as KVM's biggest weakness today and how i suggest it to be fixed :-) If you dont agree with me, then dont do it - no need to get emotional about it. Thanks, Ingo --
I don't know what you're looking at, but in the past month, there's been 56 unique contributors, with 411 changesets. I count 16 people employed I'm not saying that KVM isn't significant. I'm employed to work on QEMU because of KVM. I'm just saying that KVM users aren't 99% of the community and that we can't neglect the rest of the community. Regards, Anthony Liguori --
Hi there, not really trying to get into the CC list of this discussion ;) but for what is worth I'd like to share my opinion on the matter. Full agreement with that. CVS/git/patches and development model is next to irrelevant compared to the basic design of the code. qemu (and especially qemu-kvm) is surely much closer to perf, than a firefox or openoffice, because there is some tight interconnect with the kernel API. And the skills required to produce useful patches in qemu are similar to the skills requires to produce useful patches for the kernel, more often than not a new feature in kvm also requires some merging of a qemu-kvm side patch (it always happened to me so far ;). But clearly we've to draw a barrier somewhere and while I could see things like systemtap and util-linux included into the kernel and perf already is, I've an hard time to see userland code supporting kernels other than linux into the kernel. I think that's probably where I'd draw the line. Let's say somebody creates a pure paravirt userland for kvm without full driver emulation that only runs on a linux kernel and no other OS, maybe that thing wouldn't be so controversial to include into the kernel as qemu is. qemu is clearly beyond the "only-running-on-a-linux-kernel" barrier... I'd definitely start with systemtap, which I think is even more suitable than perf to be merged into the kernel. Things useful only for developers like perf/systemtap makes even more sense to fetch silently hidden in a single pull. Those projects are so ideal to fetch together because you run your own compiled userland binary and not an rpm, and you need very latest kernel and userland package and sometime new userland might not work so well with older kernel too and the other way around. they're tool for developers and no developer cares about API as they rebuild latest userland code anyway, they almost It also boils down to the maintainer, where the code is, defines the maintainer who pushes/commits it to the ...
Ok. Then apply this to the kernel. I'm then happy to take patches. Regards, Anthony Liguori
QEMU is about 600k LOC. We have a mechanism to compile out portions of the code but a lot things are tied together in an intimate way. In the long run, we're working on adding stronger interfaces such that we can split components out into libraries that are consumable by other applications. Simplying forking the device model won't work. Well more than half of our contributors are not coming from KVM developers/users. If you just fork the device models, you start to lose a ton of fixes (look at Xen and VirtualBox). So feel free to either 1) apply my previous patch and then start working on a "clean (and minimal)" QEMU or 2) wait to commit my previous patch and start sending patches to clean up QEMU. Absolute none of this is going to give you a VirtualBox like GUI for QEMU. Regards, Anthony Liguori --
Since we want to implement a pmu usable for the guest anyway why we don't just use a guests perf to get all information we want? If we get a pmu-nmi from the guest we just re-inject it to the guest and perf in the guest gives us all information we wand including kernel and userspace symbols, stack traces, and so on. In the previous thread we discussed about a direct trace channel between guest and host kernel (which can be used for ftrace events for example). This channel could be used to transport this information to the host kernel. The only additional feature needed is a way for the host to start a perf instance in the guest. Opinions? Joerg --
I guess this aims to get information from old environments running on Interesting! I know the people who are trying to do that with systemtap. # ssh localguest perf record --host-chanel ... ? B-) -- Masami Hiramatsu e-mail: mhiramat@redhat.com --
Look at the previous posting of this patch, this is something new and rather unique. The main power in the 'perf kvm' kind of instrumentation is to profile _both_ the host and the guest on the host, using the same tool (often using the same kernel) and using similar workloads, and do profile comparisons using 'perf diff'. Note that KVM's in-kernel design makes it easy to offer this kind of host/guest shared implementation that Yanmin has created. Other virtulization solutions with a poorer design (for example where the hypervisor code base is split away from the guest implementation) will have it much harder to create something similar. That kind of integrated approach can result in very interesting finds straight away, see: http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html ( the profile there demoes the need for spinlock accelerators for example - there's clearly assymetrically large overhead in guest spinlock code. Guess how much else we'll be able to find with a full 'perf kvm' implementation. ) One of the main goals of a virtualization implementation is to eliminate as many performance differences to the host kernel as possible. From the first day KVM was released the overriding question from users was always: 'how much slower is it than native, and which workloads are hit worst, and why, and could you pretty please speed up important workload XYZ'. 'perf kvm' helps exactly that kind of development workflow. Note that with oprofile you can already do separate guest space and host space profiling (with the timer driven fallbackin the guest). One idea with 'perf kvm' is to change that paradigm of forced separation and forced duplication and to supprt the workflow that most developers employ: use the host space for development and unify instrumentation in an intuitive framework. Yanmin's 'perf kvm' patch is a very good step towards that goal. Anyway ... look at the patches, try them and see it for yourself. Back in the ...
With the patch, 'perf kvm report --sort pid" could show summary statistics for all guest os instances. Then, use Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very Right. I discussed with Yangsheng. I will move above data structures and callbacks to file arch/x86/kvm/x86.c, and add get_ip, a new callback to kvm_x86_ops. Yanmin --
Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yanmin --
That seems like a worthwhile addition regardless of this thread. Profile all current threads and any new ones. It probably makes sense to call this --pid and rename the existing --pid to --thread. -- error compiling committee.c: too many arguments to function --
Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Ingo --
Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr->disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> --- diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c --- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800 +++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 16:30:17.755551706 +0800 @@ -27,7 +27,7 @@ #include <unistd.h> #include <sched.h> -static int fd[MAX_NR_CPUS][MAX_COUNTERS]; +static int *fd[MAX_NR_CPUS][MAX_COUNTERS]; static long default_interval = 0; @@ -43,6 +43,9 @@ static int raw_samples = 0; static int system_wide = 0; static int profile_cpu = -1; static pid_t target_pid = -1; +static pid_t target_tid = -1; +static int *all_tids = NULL; +static int thread_num = 0; static pid_t child_pid = -1; static int inherit = 1; static int force = 0; @@ -60,7 +63,7 @@ static struct timeval this_read; static u64 bytes_written = 0; -static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS]; +static struct pollfd *event_array; static int nr_poll = ...
Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. Yanmin --
I worked out 3 new patches against tip/master tree of Mar. 17th. 1) Patch perf_stat: Fix the issue that perf doesn't enable counters when target_pid != -1. Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf stat -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 2) Patch perf_record: Fix the issue that when perf forks/exec a subcommand, it should enable all counters after the new process is execing.Change the condition to fork/exec subcommand. If there is a subcommand parameter, perf always fork/exec it. The usage example is: #perf record -f -a sleep 10 So this command could collect statistics for 10 seconds precisely. User still could stop it by CTRL+C. 3) perf_pid: Change parameter --pid to process-wide collection. Add --tid which means collecting thread-wide statistics. Usage example is: #perf top -p 8888 #perf record -p 8888 -f sleep 10 #perf stat -p 8888 -f sleep 10 Arnaldo, Pls. apply the 3 attached patches. Yanmin
Cool! Mind sending them as a series of patches instead of attachment? That makes it easier to review them. Also, the Signed-off-by lines seem to be missing plus we need a per patch changelog as well. Thanks, Ingo --
Yeah, please, and I hadn't merged them, so the resend was the best thing to do. - Arnaldo --
That certainly works, though automatic association of guest data with There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs You will need access to the vcpu pointer (kvm_rip_read() needs it), you can put it in a percpu variable. I guess if it's not null, you know you're in a guest, so no need for PF_VCPU. -- error compiling committee.c: too many arguments to function --
Thanks. Originally, I planed to add a -G parameter to perf. Such like -G 8888:/XXX/XXX/guestkallsyms:/XXX/XXX/modules,8889:/XXX/XXX/guestkallsyms:/XXX/XXX/modules 8888 and 8889 are just qemu guest pid. So we could define multiple guest os symbol files. But it seems ugly, and 'perf kvm report --sort pid" and 'perf kvm top --pid' could provide I'm not sure if vmexit does break NMI context or not. Hardware NMI context Good suggestion. Thanks. --
After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because "int $2" don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic->send_IPI_self(NMI_VECTOR) to replace "int $2". Something unexpected is happening... -- regards Yang, Sheng --
That's pretty bad, as NMI runs on a separate stack (via IST). So if another NMI happens while our int $2 is running, the stack will be I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. -- error compiling committee.c: too many arguments to function --
Though hardware didn't provide this kind of block, software at least would warn about it... nmi_enter() still would be executed by "int $2", and result in BUG() if we are already in NMI context(OK, it is a little better than apic_send_IPI_self() already took care of APIC_DM_NMI. And NMI handler would block the following NMI? -- regards Yang, Sheng --
It wouldn't - won't work without extensive changes. -- error compiling committee.c: too many arguments to function --
You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Zach --
Um? Why? Especially kernel is already using it to deliver NMI. -- regards Yang, Sheng --
That's the only defined case, and it is defined because the vector field is ignore for DM_NMI. Vol 3A (exact section numbers may vary depending on your version). 8.5.1 / 8.6.1 '100 (NMI) Delivers an NMI interrupt to the target processor or processors. The vector information is ignored' 8.5.2 Valid Interrupt Vectors 'Local and I/O APICs support 240 of these vectors (in the range of 16 to 255) as valid interrupts.' 8.8.4 Interrupt Acceptance for Fixed Interrupts '...; vectors 0 through 15 are reserved by the APIC (see also: Section 8.5.2, "Valid Interrupt Vectors")' So I misremembered, apparently you can deliver interrupts 0x10-0x1f, but vectors 0x00-0x0f are not valid to send via APIC or I/O APIC. Zach --
As you pointed out, NMI is not "Fixed interrupt". If we want to send NMI, it would need a specific delivery mode rather than vector number. And if you look at code, if we specific NMI_VECTOR, the delivery mode would be set to NMI. So what's wrong here? -- regards Yang, Sheng --
OK, I think I understand your points now. You meant that these vectors can't be filled in vector field directly, right? But NMI is a exception due to DM_NMI. Is that your point? I think we agree on this. -- regards Yang, Sheng --
Yes, I think we agree. NMI is the only vector in 0x0-0xf which can be sent via self-IPI because the vector itself does not matter for NMI. Zach --
Here is the new patch of V2 against tip/master of March 17th
if anyone wants to try it.
ChangeLog V2:
1) Based on Avi's suggestion, I moved callback functions
to generic code area. So the kernel part of the patch is
clearer.
2) Add 'perf kvm stat'.
From: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.
The patch adds new subcommand kvm to perf.
perf kvm top
perf kvm record
perf kvm report
perf kvm diff
perf kvm stat
The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.
Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top
--------------------------------------------------------------------------------------------------------------------------
PerfTop: 16010 irqs/sec kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _________________________ _______________________
38770.00 20.4% __ticket_spin_lock [guest.kernel.kallsyms]
22560.00 11.9% ftrace_likely_update [kernel.kallsyms]
9208.00 4.8% __lock_acquire [kernel.kallsyms]
5473.00 2.9% trace_hardirqs_off_caller [kernel.kallsyms]
5222.00 2.7% copy_user_generic_string [guest.kernel.kallsyms]
4450.00 2.3% ...Nice progress! Will be really be painful to developers - to enter that long line while we have these things called 'computers' that ought to reduce human work. Also, it's incomplete, we need access to the guest system's binaries to do ELF symbol resolution and dwarf decoding. So we really need some good, automatic way to get to the guest symbol space, so that if a developer types: perf kvm top Then the obvious thing happens by default. (which is to show the guest overhead) There's no technical barrier on the perf tooling side to implement all that: perf supports build-ids extensively and can deal with multiple symbol spaces - as long as it has access to it. The guest kernel could be ID-ed based on its /sys/kernel/notes and /sys/module/*/notes/.note.gnu.build-id build-ids. So some sort of --guestmount option would be the natural solution, which points to the guest system's root: and a Qemu enumeration of guest mounts (which would be off by default and configurable) from which perf can pick up the target guest all automatically. (obviously only under allowed permissions so that such access is secure) This would allow not just kallsyms access via $guest/proc/kallsyms but also gives us the full space of symbol features: access to the guest binaries for annotation and general symbol resolution, command/binary name identification, etc. Such a mount would obviously not broaden existing privileges - and as an additional control a guest would also have a way to indicate that it does not wish a guest mount at all. Unfortunately, in a previous thread the Qemu maintainer has indicated that he will essentially NAK any attempt to enhance Qemu to provide an easily discoverable, self-contained, transparent guest mount on the host side. No technical justification was given for that NAK, despite my repeated requests to particulate the exact security problems that such an approach would cause. If that NAK does not stand in that form then i'd like ...
I still think it is the best and most generic way to let the guest do the symbol resolution. This has several advantages: 1. The guest knows best about its symbol space. So this would be extensible to other guest operating systems. A brave developer may even implement symbol passing for Windows or the BSDs ;-) 2. The guest can decide for its own if it want to pass this inforamtion to the host-perf. No security issues at all. 3. The guest can also pass us the call-chain and we don't need to care about complicated of fetching from the guest ourself. 4. This way extensible to nested virtualization too. How we speak to the guest was already discussed in this thread. My personal opinion is that going through qemu is an unnecessary step and we can solve that more clever and transparent for perf. Joerg --
Having access to the actual executable files that include the symbols achieves precisely that - with the additional robustness that all this functionality is concentrated into the host, while the guest side is kept minimal (and It can decide whether it exposes the files. Nor are there any "security You need to be aware of the fact that symbol resolution is a separate step from call chain generation. I.e. call-chains are a (entirely) separate issue, and could reasonably be done in the guest or in the host. Nested virtualization is actually already taken care of by the filesystem solution via an existing method called 'subdirectories'. If the guest offers sub-guests then those symbols will be exposed in a similar way via its own 'guest files' directory hierarchy. I.e. if we have 'Guest-2' nested inside 'the 'Guest-Fedora-1' instance, we get: /guests/ /guests/Guest-Fedora-1/etc/ /guests/Guest-Fedora-1/usr/ we'd also have: /guests/Guest-Fedora-1/guests/Guest-2/ So this is taken care of automatically. I.e. none of the four 'advantages' listed here are actually advantages over my Meaning exactly what? Thanks, Ingo --
If you want to access the guests file-system you need a piece of software running in the guest which gives you this access. But when you get an event this piece of software may not be runnable (if the guest is in an interrupt handler or any other non-preemptible code path). When the host finally gets access to the guests filesystem again the source of that event may already be gone (process has exited, module unloaded...). The only way to solve that is to pass the event information to the guest I am not talking about security. Security was sufficiently flamed about Avi was against that but I think it would make sense to give names to virtual machines (with a default, similar to network interface names). Then we can create a directory in /dev/ with that name (e.g. /dev/vm/fedora/). Inside the guest a (priviledged) process can create some kind of named virt-pipe which results in a device file created in the guests directory (perf could create /dev/vm/fedora/perf for example). This file is used for guest-host communication. Thanks, Joerg --
You were talking about security, in the portion of your mail that you snipped I understood that portion to mean what it says: that your claim that your All i saw was my suggestion to allow a guest to securely (and scalably and conveniently) integrate/mount its filesystems to the host if both sides (both the host and the guest) permit it, to make it easier for instrumentation to pick up symbol details. I.e. if a guest runs then its filesystem may be present on the host side as: /guests/Fedora-G1/ /guests/Fedora-G1/proc/ /guests/Fedora-G1/usr/ /guests/Fedora-G1/.../ ( This feature would be configurable and would be default-off, to maintain the current status quo. ) i.e. it's a bit like sshfs or NFS or loopback block mounts, just in an integrated and working fashion (sshfs doesnt work well with /proc for example) and more guest transparent (obviously sshfs or NFS exports need per guest configuration), and lower overhead than sshfs/NFS - i.e. without the (unnecessary) networking overhead. That suggestion was 'countered' by an unsubstantiated claim by Anthony that this kind of usability feature would somehow be a 'security nighmare'. In reality it is just an incremental, more usable, faster and more guest-transparent form of what is already possible today via: - loopback mounts on host - NFS exports - SMB exports - sshfs - (and other mechanisms) I wish there was at least flaming about it - as flames tend to have at least some specifics in them. What i saw instead was a claim about a 'security nightmare', which was, when i asked for specifics, was followed by deafening silence. And you appear to have repeated that claim here, unwilling to back it up with specifics. Thanks, Ingo --
The very same is true of profiling in the host space as well (KVM is nothing special here, other than its unreasonable insistence on not enumerating readily available information in a more usable way). So are you suggesting a solution to a perf problem we already solved differently? (and which i argue we solved in a better way) We have solved that in the host space already (and quite elaborately so), and not via your suggestion of moving symbol resolution to a different stage, but by properly generating the right events to allow the post-processing stage to see processes that have already exited, to robustly handle files that have been rebuilt, etc. From an instrumentation POV it is fundamentally better to acquire the right data and delay any complexities to the analysis stage (the perf model) than to complicate sampling (the oprofile dcookies model). Your proposal of 'doing the symbol resolution in the guest context' is in essence re-arguing that very similar point that oprofile lost. Did you really intend to re-argue that point as well? If yes then please propose an alternative implementation for everything that perf does wrt. symbol lookups. What we propose for 'perf kvm' right now is simply a straight-forward extension of the existing (and well working) symbol handling code to Best would be if you demonstrated any problems of the perf symbol lookup code you are aware of on the host side, as it has that exact design you are criticising here. We are eager to fix any bugs in it. If you claim that it's buggy then that should very much be demonstratable - no need to go into theoretical arguments about it. ( You should be aware of the fact that perf currently works with 'processes exiting prematurely' and similar scenarios just fine, so if you want to That is kind of half of my suggestion - the built-in enumeration guests and a guaranteed channel to them accessible to tools. (KVM already has its own special channel so it's not like channels ...
I am not claiming anything. I just try to imagine how your proposal will look like in practice and forgot that symbol resolution is done at a later point. But even with defered symbol resolution we need more information from the guest than just the rip falling out of KVM. The guest needs to tell us about the process where the event happened (information that the host has about itself without any hassle) and which executable-files it was Probably. At least it is the solution that fits best into the current design of perf. But we should think about how this will be done. Raw disk access is no solution because we need to access virtual file-systems of the guest too. Network filesystems may be a solution but then we come back to the 'deployment-nightmare'. Joerg --
Correct - for full information we need a good paravirt perf integration of the kernel bits to pass that through. (I.e. we want to 'integrate' the PID space as well, at least within the perf notion of PIDs.) I never said anything about 'raw disk access'. Have you seen my proposal of (optional) VFS namespace integration? (It can be found repeated the Nth time in my mail you replied to) Thanks, Ingo --
Slightly tangential, but there is another case that has some of the same problems: profiling other language runtimes than C and C++, say Python. At the moment profilers will generally tell you what is going on inside the python runtime, but not what the python program itself is doing. To fix that problem, it seems like we need some way to have python export what is going on. Maybe the same mechanism could be used to both access what is going on in qemu and python. Soren --
oprofile already has an interface to let JITs export information about the JITed code. C Python is not a JIT, but presumably one of the python JITs could do it. http://oprofile.sourceforge.net/doc/devel/index.html I know it's not envogue anymore and you won't be a approved cool kid if you do, but you could just use oprofile? Ok presumably one would need to do a python interface for this first. I believe it's currently only implemented for Java and Mono. I presume it might work today with IronPython on Mono. IMHO it doesn't make sense to invent another interface for this, although I'm sure someone will propose just that. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It's not that I personally want to profile a particular python
program. I'm interested in the more general problem of extracting more
information from profiled user space programs than just stack traces.
Examples:
- What is going on inside QEMU?
- Which client is the X server servicing?
- What parts of a python/shell/scheme/javascript program is
taking the most CPU time?
I don't think the oprofile JIT interface solves any of these
problems. (In fact, I don't see why the JIT problem is even hard. The
JIT compiler can just generate a little ELF file with symbols in it,
and the profiler can pick it up through the mmap events that you get
I am bringing this up because I want to extend sysprof to be more
useful.
Soren
--
I suspect for those you rather need event based tracers of some sort, similar to kernel trace points. Otherwise you would need own separate stacks and other complications. systemtap has some effort to use the dtrace instrumentation that crops up in more and more user programs for this. It wouldn't surprise me if that was already in python and other programs you're interested in. I presume right now it only works if you apply the utrace monstrosity though, but perhaps the new uprobes patches floating around will come to rescue. There also was some effort to have a pure user space daemon based approach for LTT, but I believe that currently needs own trace points. Again I fully expect someone to reinvent the wheel here That would require keeping those temporary ELF files for potentially unlimited time around (profilers today look at the ELF files at the final analysis phase, which might be weeks away) Also that would be a lot of overhead for the JIT and most likely be a larger scale rewrite for a given JIT code base. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
'perf record' will traverse the perf.data file just collected and, if the binaries have build-ids, will stash them in ~/.debug/, keyed by build-id just like the -debuginfo packages do. So only the binaries with hits. Also one can use 'perf archive' to create a tar.bz2 file with the files with hits for the specified perf.data file, that can then be transfered to another machine, whatever arch, untarred at ~/.debug and then the report can be done there. As it is done by build-id, multiple 'perf record' sessions share files in the cache. Right now the whole ELF file (or /proc/kallsyms copy) is stored if collected from the DSO directly, or the bits that are stored in -debuginfo files if we find it installed (so smaller). We could strip that down further by storing just the ELF sections needed to make sense of the symtab. - Arnaldo --
These kinds of questions usually require navigation through internal data of the user-space process ("Where in this linked list is this pointer?"), and often also correlating them with history ("which socket/fd was most recently serviced?"). Systemtap excels at letting one express such things. - FChE --
perf also has supports for this and Pekka Enberg's jato uses it: http://penberg.blogspot.com/2009/06/jato-has-profiler.html - Arnaldo --
Right, we need to move that into a library though (always meant to do that, never got around to doing it). That way the app can link against a dso with weak empty stubs and have perf record LD_PRELOAD a version that has a suitable implementation. That all has the advantage of not exposing the actual interface like we do now. --
Yes, I agree with you and Avi that we need the enhancement be user-friendly. One of my start points is to keep the tool having less dependency on other components. Admin/developers could write script wrappers quickly if I tried sshfs quickly. sshfs could mount root filesystem of guest os nicely. I could access the files quickly. However, it doesn't work when I access /proc/ and /sys/ because sshfs/scp depend on file size while the sizes of most If sshfs could access /proc/ and /sys correctly, here is a design: --guestmount points to a directory which consists of a list of sub-directories. Every sub-directory's name is just the qemu process id of guest os. Admin/developer mounts every guest os instance's root directory to corresponding sub-directory. Then, perf could access all files. It's possible because guest os instance happens to be multi-threading in a process. One of the defects is the accessing to --
If the MMAP events on the guest included a cookie that could later be used to query for the symtab of that DSO, we wouldn't need to access the guest FS at all, right? With build-ids and debuginfo-install like tools the symbol resolution could be performed by using the cookies (build-ids) as keys to get to the *-debuginfo packages with matching symtabs (and DWARF for source annotation, etc). We have that for the kernel as: [acme@doppio linux-2.6-tip]$ l /sys/kernel/notes -r--r--r-- 1 root root 36 2010-03-22 13:14 /sys/kernel/notes [acme@doppio linux-2.6-tip]$ l /sys/module/ipv6/sections/.note.gnu.build-id -r--r--r-- 1 root root 4096 2010-03-22 13:38 /sys/module/ipv6/sections/.note.gnu.build-id [acme@doppio linux-2.6-tip]$ That way we would cover DSOs being reinstalled in long running 'perf record' sessions too. This was discussed some time ago but would require help from the bits that load DSOs. build-ids then would be first class citizens. - Arnaldo --
It depends on specific sub commands. As for 'perf kvm top', developers want to see the profiling immediately. Even with 'perf kvm record', developers also want to see results quickly. At least I'm eager for the results when investigating We can't make sure guest os uses the same os images, or don't know where we could find the original DVD images being used to install guest os. Current perf does save build id, including both kernls's and other application --
That is not a problem, if you have the relevant buildids in your cache (Look in your machine at ~/.debug/), it will be as fast as ever. If you use a distro that has its userspace with build-ids, you probably You don't have to have guest and host sharing the same OS image, you just have to somehow populate your buildid cache with what you need, be it using sshfs or what Ingo is suggesting once, or using what your vendor provides (debuginfo packages). And you just have to do it once, But it doesn't fully supports right now, as I explained, build-ids are collected at the end of the record session, because we have to open the DSOs that had hits to get the 20 bytes cookie we need, the build-id. If we had it in the PERF_RECORD_MMAP record, we would close this race, and the added cost at load time should be minimal, to get the ELF section with it and put it somewhere in task struct. If only we could coalesce it a bit to reclaim this: [acme@doppio linux-2.6-tip]$ pahole -C task_struct ../build/v2.6.34-rc1-tip+/kernel/sched.o | tail -5 /* size: 5968, cachelines: 94, members: 150 */ /* sum members: 5943, holes: 7, sum holes: 25 */ /* bit holes: 1, sum bit holes: 28 bits */ /* last cacheline: 16 bytes */ }; [acme@doppio linux-2.6-tip]$ 8-) Or at least get just one of those 4 bytes holes then we could stick it at the end to get our build-id there, accessing it would be done only at PERF_RECORD_MMAP injection time, i.e. close to the time when we actually are loading the executable mmap, i.e. close to the time when the loader is injecting the build-id, I guess the extra memory and - Arnaldo --
