Hello folks, In this series, I present a couple of fixes for kvmclock. In patch 1, a guest-side fix is proposed for a problem that is biting us for quite a while now. the tsc inside VMs does not seem to be that good, (up to now, only single-socket nehalems were stable enough), and we're seeing small (but nevertheless wrong) time warps inside SMP guests. I am proposing the fix to reside on common code in pvclock.c, but would be good to hear Jeremy on this. On the other 3 patches, I change kvmclock MSR numbers in a compatible fashion. Both MSR sets will be supported for a while. Patch 5 adds documentation about kvmclock, which to date, we lacked. Glauber Costa (5): Add a global synchronization point for pvclock change msr numbers for kvmclock Try using new kvm clock msrs export new cpuid KVM_CAP add documentation about kvmclock Documentation/kvm/kvmclock.txt | 138 +++++++++++++++++++++++++++++++++++++++ arch/x86/include/asm/kvm_para.h | 12 +++- arch/x86/kernel/kvmclock.c | 68 +++++++++++++------ arch/x86/kernel/pvclock.c | 23 +++++++ arch/x86/kvm/x86.c | 13 ++++- include/linux/kvm.h | 1 + 6 files changed, 231 insertions(+), 24 deletions(-) create mode 100644 Documentation/kvm/kvmclock.txt --
Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.
Signed-off-by: Glauber Costa <glommer@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 8 ++++++--
arch/x86/kvm/x86.c | 7 ++++++-
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..0cffb96 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -17,8 +17,12 @@
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
-#define MSR_KVM_WALL_CLOCK 0x11
-#define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_WALL_CLOCK_OLD 0x11
+#define MSR_KVM_SYSTEM_TIME_OLD 0x12
+
+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK 0x4b564d00
+#define MSR_KVM_SYSTEM_TIME 0x4b564d01
#define KVM_MAX_MMU_OP_BATCH 32
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8824b73..714aae2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -575,9 +575,10 @@ static inline u32 bit(int bitno)
* kvm-specific. Those are put in the beginning of the list.
*/
-#define KVM_SAVE_MSRS_BEGIN 5
+#define KVM_SAVE_MSRS_BEGIN 7
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+ MSR_KVM_SYSTEM_TIME_OLD, MSR_KVM_WALL_CLOCK_OLD,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1099,10 +1100,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu->arch.ia32_misc_enable_msr = data;
break;
+ case MSR_KVM_WALL_CLOCK_OLD:
case MSR_KVM_WALL_CLOCK:
vcpu->kvm->arch.wall_clock = data;
kvm_write_wall_clock(vcpu->kvm, data);
break;
+ case MSR_KVM_SYSTEM_TIME_OLD:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu->arch.time_page) {
...We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.
However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.
So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.
Signed-off-by: Glauber Costa <glommer@redhat.com>
---
arch/x86/include/asm/kvm_para.h | 4 ++
arch/x86/kernel/kvmclock.c | 68 +++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 0cffb96..a32710a 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+/* We could just try to use new msr values, but they are queried very early,
+ * kernel does not have idt handlers yet, and failures are fatal */
+#define KVM_FEATURE_CLOCKSOURCE2 3
+
#define MSR_KVM_WALL_CLOCK_OLD 0x11
#define MSR_KVM_SYSTEM_TIME_OLD 0x12
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..6d814ce 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,7 @@
#define KVM_SCALE 22
static int kvmclock = 1;
+static int kvm_use_new_msrs = 0;
static int parse_no_kvmclock(char *arg)
{
@@ -41,6 +42,18 @@ early_param("no-kvmclock", parse_no_kvmclock);
static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
static struct pvclock_wall_clock wall_clock;
+static int kvm_system_time_write_value(int low, int high)
+{
+ if (kvm_use_new_msrs)
+ return native_write_msr_safe(MSR_KVM_SYSTEM_TIME_OLD, low, high);
+ else
+ return ...This comment has no place in a header that's oriented to userspace. Instead of this and all those if ()s all over the place, you can define variables for the MSR indices. Set them up on initialization, the rest of the code just looks them up. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
This is exposed to userspace. Userspace that is compiled with the new headers, but runs on an old kernel, will break. So you need to keep the old names, and define a new KVM_FEATURE for the new names. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
In recent stress tests, it was found that pvclock-based systems could seriously warp in smp systems. Using ingo's time-warp-test.c, I could trigger a scenario as bad as 1.5mi warps a minute in some systems. (to be fair, it wasn't that bad in most of them). Investigating further, I found out that such warps were caused by the very offset-based calculation pvclock is based on. This happens even on some machines that report constant_tsc in its tsc flags, specially on multi-socket ones. Two reads of the same kernel timestamp at approx the same time, will likely have tsc timestamped in different occasions too. This means the delta we calculate is unpredictable at best, and can probably be smaller in a cpu that is legitimately reading clock in a forward ocasion. Some adjustments on the host could make this window less likely to happen, but still, it pretty much poses as an intrinsic problem of the mechanism. A while ago, I though about using a shared variable anyway, to hold clock last state, but gave up due to the high contention locking was likely to introduce, possibly rendering the thing useless on big machines. I argue, however, that locking is not necessary. We do a read-and-return sequence in pvclock, and between read and return, the global value can have changed. However, it can only have changed by means of an addition of a positive value. So if we detected that our clock timestamp is less than the current global, we know that we need to return a higher one, even though it is not exactly the one we compared to. OTOH, if we detect we're greater than the current time source, we atomically replace the value with our new readings. This do causes contention on big boxes (but big here means *BIG*), but it seems like a good trade off, since it provide us with a time source guaranteed to be stable wrt time warps. After this patch is applied, I don't see a single warp in time during 5 days of execution, in any of the machines I saw them before. Signed-off-by: Glauber ...
__cacheline_aligned_in_smp to avoid other data from sharing the Don't you need to handle wrap-around? --
Is the problem that the tscs are starting out of sync, or that they're
drifting relative to each other over time? Do the problems become worse
Does this need a barrier() to prevent the compiler from re-reading
last_value for the subsequent lines? Otherwise "(ret < last)" and
So if CPU A's tsc is, say, 50us behind CPU B's, then it will return a
static time (from last_time) for 50us until its tsc catched up - or if
CPU A happens to update last_time to give it a kick?
Is it worth trying to update CPU A's tsc offset at the same time?
J
--
This is one source of the problem, but the same thing happens at many levels... tsc may start out of sync, drift between sockets, be badly re-calibrated by the BIOS, etc... the issue persists even if the TSCs are perfectly in sync - the measurement of them is not. So reading TSC == 100,000 units at time A and then waiting 10 units, one may read TSC == 100,010 +/- 5 units because the code stream is not perfectly serialized - nor can it be. There will always be some amount of error unless running in perfect lock-step, which only happens in a simulator. This inherent measurement error can cause apparent time to go backwards when measured simultaneously across multiple CPUs, or when re-calibrating against an external clocksource. Combined with other factors as above, it can be of sufficient magnitude to be noticed. KVM clock is particularly exposed to the problem because the TSC is measured and recalibrated for each virtual CPU whenever there is a physical CPU switch, so micro-adjustments forwards and backwards may occur during the recalibration - and appear as a real backwards time warp to the guest. I have some patches to fix that issue, but the SMP problem remains to be fixed - and is addressed quite thoroughly by this patch. Zach --
Oh yes, just trying to avoid a patch with both atomic64_read() and ACCESS_ONCE(). -- error compiling committee.c: too many arguments to function --
you're mixing the private version of the patch you saw with this one. there isn't any atomic reads in here. I'll use a barrier then --
This patch writes last_value atomically, but reads it non-atomically. A barrier is insufficient. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
What avi says! :-) On a 32bit machine a 64bit read are two 32bit reads, so last = last_value; becomes: last.high = last_value.high; last.low = last_vlue.low; (or the reverse of course) Now imagine a write getting interleaved with that ;-) --
You could explicitly do:
do {
h = last.high;
barrier();
l = last.low;
barrier();
} while (last.high != h);
This works because we expect last to be always increasing, so the only
worry is low wrapping and incrementing high, and is more efficient than
making the read fully atomic (the write is still cmpxchg64). But it's
pretty ugly to open code just for 32b architectures; its something that
might be useful to turn into a general abstraction (monotonic_read_64
FTW!). I already have code like this in the Xen time code, so I could
make immediate use of it.
J
--
I don't think this is worthwhile - the cmpxchg is not that expensive on most kvm capable hosts (the exception is the Pentium D). btw, do you want this code in pvclock.c, or shall we keep it kvmclock specific? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I think its a pvclock-level fix. I'd been hoping to avoid having
something like this, but I think its ultimately necessary.
J
--
Did you observe drift on Xen, or is this "ultimately" pointing at the future? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
People are reporting weirdnesses that "clocksource=jiffies" apparently
resolves. Xen and KVM are faced with the same hardware constraints, and
it wouldn't surprise me if there were small measurable
non-monotonicities in the PV clock under Xen. May as well be safe.
Of course, it kills any possibility of being able to usefully export
this interface down to usermode.
My main concern about this kind of simple fix is that if there's a long
term systematic drift between different CPU's tscs, then this will
somewhat mask the problem while giving really awful time measurement on
the "slow" CPU(s). In that case it really needs to adjust the scaling
factor to correct for the drift (*not* update the offset). But if we're
definitely only talking about fixed, relatively small time offsets then
it is fine.
J
--
Does the drift only occur on SMP VMs? --
Can you by any chance run ingo's time warp test on those machines? You need to define TOD to 1, and leave out the TSC test. For me, warps exists on every machine out there, but the nehalems, so far --
Plus, replace cpuid by lfence/mfence. cpuid will trap. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?
J
--
Right. But I'm not sure lfence/mfence are sufficient - from what I understand rdtsc can pass right through them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Does lfence / mfence actually serialize? I thought there was some great confusion about that not being the case on all AMD processors, and possibly not at all on Intel. A trap, however is a great way to serialize. I think, there is no serializing instruction which can be used from userspace which does not trap, at least, I don't know one off the top of my head. Zach --
rsm is not technically privileged... but not quite usable from usermode ;)
J
--
rsm under hardware virtualization makes my head hurt --
Either one independently is sufficient for me. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
iret. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Well, on a 32b system, you can explicitly order the updates of low and
high, then do a high-low-checkhigh read. That would be much more
efficient than atomic64. If we really care about 32b.
J
--
Yes it is. But as I said, this seem to be a very deep worst case scenario. Most of boxes The offsets usually seem pretty small, under a microsecond. So I don't think it has anything to do with tscs starting out of sync. Specially because the delta-based calculation has the exact purpose of handling that case. --
So you think they're drifting out of sync from an initially synced
state? If so, what would bound the drift?
J
--
I think delta calculation introduces errors. Marcelo can probably confirm it, but he has a nehalem with an appearently very good tsc source. Even this machine warps. It stops warping if we only write pvclock data structure once and forget it, (which only updated tsc_timestamp once), according to him. Obviously, we can't do that everywhere. --
Yes. So its not as if the guest visible TSCs go out of sync (they don't on this machine Glauber mentioned, or even on a multi-core Core 2 Duo), but the delta calculation is very hard (if not impossible) to get right. The timewarps i've seen were in the 0-200ns range, and very rare (once --
For the record, we can only even do that in those machines. If we try to update time structures only once in machines with the X86_FEATURE_TSC_SAYS_IT_IS_OKAY_BUT_IN_REALITY_IS_NOT_OKAY feature flag, guests won't even boot. We can detect that, and besides doing calculation only once, also export some bit indicating that to the guest. Humm... I'm thinking now, that because of migration, we should check this bit every time, because we might have changed host. So instead of using an expensive cpuid check, we should probably use some bit in the vcpu_time_info structure, and use a cpuid bit just to say it is enabled. Jeremy, are you okay in turning one of the pad fields in the structure into a flags field? --
Right, we need a bit in the structure itself that says you can trust the time not to go backwards, and a bit in cpuid that says you can trust the bit in the structure (unless we can be sure no implementations leak garbage into the padding field). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
There's a perfect way to do this and it still fails to stop timewarps. You can set the performance counters to overflow if more instructions are issued than your code path, run an assembly instruction stream and if the performance interrupt hits, restart the calibration. The calibration happens not just once, but on every migration, and currently, I believe, on every VCPU switch. Even if we reduce the number of calibrations to the bare minimum and rule out SMIs and NMIs, there will still be variation due to factors beyond our control because of the unpredictable nature of cache and instruction issue. However, X86_FEATURE_NONSTOP_TRULY_RELIABLE_TSC_HONESTLY does imply one key feature which the code is missing today: on SMP VMs, the calibration of kvmclock needs to be done only once, and the clock can then be used for all VCPUs. That, I think, stops Glauber's bug from appearing on the server side. I will spin that into my web of patches and send the cocoon out sometime this evening. Zach --
It's completely impractical. The PMU is a global resource that is already shared among users and the host; programming and restoring it is That's the plan. -- error compiling committee.c: too many arguments to function --
Please define a cpuid bit that makes this optional. When we eventually -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
On a 32-bit guest, that is. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Right, do bear in mind that the x86 implementation of atomic64_read() is terrifyingly expensive, it is better to not do that read and simply use the result of the cmpxchg. --
atomic64_read() _is_ cmpxchg64b. Are you thinking of some clever implementation for smp i386? -- error compiling committee.c: too many arguments to function --
No, what I was suggesting was to rewrite that loop no to need the
initial read but use the cmpxchg result of the previous iteration.
Something like:
u64 last = 0;
/* more stuff */
do {
if (ret < last)
return last;
last = cmpxchg64(&last_value, last, ret);
} while (last != ret);
That only has a single cmpxchg8 in there per loop instead of two
(avoiding the atomic64_read() one).
--
Still have two cmpxchgs in the common case. The first iteration will fail, fetching last_value, the second will work. It will be better when we have contention, though, so it's worthwhile. -- error compiling committee.c: too many arguments to function --
Right, another option is to put the initial read outside of the loop,
that way you'll have the best of all cases, a single LOCK'ed op in the
loop, and only a single LOCK'ed op for the fast path on sensible
architectures ;-)
last = atomic64_read(&last_value);
do {
if (ret < last)
return last;
last = atomic64_cmpxchg(&last_value, last, ret);
} while (unlikely(last != ret));
Or so.
--
No. On i386, the statement
last = last_value;
will be split by the compiler into two 32-bit loads. If a write
(atomic, using cmpxchg) on another cpu happens between those two loads,
then the variable last will have a corrupted value.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
--
Right, so on x86 we have: X86_FEATURE_CONSTANT_TSC, which only states that TSC is frequency independent, not that it doesn't stop in C states and similar fun stuff. X86_FEATURE_TSC_RELIABLE, which IIRC should indicate the TSC is constant and synced between cores. --
Sockets and boards too? (IOW, how reliable is TSC_RELIABLE)? -- error compiling committee.c: too many arguments to function --
Not sure, IIRC we clear that when the TSC sync test fails, eg when we mark the tsc clocksource unusable. --
Worrying. By the time we detect this the guest may already have gotten confused by clocks going backwards. -- error compiling committee.c: too many arguments to function --
Upstream, we are marking the TSC unstable preemptively when hardware which will eventually sync test is detected, so this should be fine. Zach --
ENOPARSE? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
That works within a socket; but multiple socket machines need not feed all sockets from the same crystal and reset line (esp. likely for multiple board machines, but might even happen with single board FSB-less processors). -- error compiling committee.c: too many arguments to function --
Fun, we also have: X86_FEATURE_NONSTOP_TSC, which states the thing doesn't stop in C states. --
All of them? I though tsc stops in some mwait deep REM sleep thing. So what do we need? test for both TSC_RELIABLE and NONSTOP_TSC? IMO TSC_RELIABLE should imply NONSTOP_TSC. -- error compiling committee.c: too many arguments to function --
The idea is that TSC will not stop ever (except s2r like stuff), not Yeah, I think RELIABLE does imply NONSTOP and CONSTANT, but NONSTOP && CONSTANT does not make RELIABLE. --
and this maps to NONSTOP, so I think we're fine. -- error compiling committee.c: too many arguments to function --
Unfortunately this is breaking Xen save/restore: if you restore on a host which was booted more recently than the save host, causing the system time to be smaller. The effect is that the domain's time leaps forward to a fixed point, and stays there until the host catches up to the source host... I guess last_time needs to be reset on this type of event. I guess the cleanest way would be for pvclock.c to register a sysdev suspend/resume handler. --
Should be for Xen only; kvm save/restore doesn't involve the guest. -- error compiling committee.c: too many arguments to function --
the suspend/resume path do adjust the time base. Maybe something similar should be done ? --
Xen doesn't guarantee the system time is monotonic across those kinds of
events. The domain could maintain its own offset to maintain an
illusion of monotonicity, but I think its simpler to just zero
last_value on resume.
J
--
