Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

From: Yoshiaki Tamura
Date: Tuesday, April 20, 2010 - 10:57 pm

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example. 

Before going into details, we would like to show how Kemari looks.  We prepared
a demonstration video at the following location.  For those who are not
interested in the code, please take a look.  
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command in QEMU.
Just add "-k" option to usual migrate command.
migrate -d -k tcp:192.168.0.20:4444
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press "c" to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with ...
From: Yoshiaki Tamura
Date: Tuesday, April 20, 2010 - 10:57 pm

When -k option is set to migrate command, it will turn on ft_mode to
start FT migration mode (Kemari).

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c     |    3 +++
 qemu-monitor.hx |    7 ++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/migration.c b/migration.c
index c81fdb4..b288e82 100644
--- a/migration.c
+++ b/migration.c
@@ -109,6 +109,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }
 
+    if (qdict_get_int(qdict, "ft"))
+        ft_mode = FT_INIT;
+        
     if (strstart(uri, "tcp:", &p)) {
         s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
                                          (int)qdict_get_int(qdict, "blk"), 
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index 16c45b7..22b72d9 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -765,13 +765,14 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,ft:-k,uri:s",
+        .params     = "[-d] [-b] [-i] [-k] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t -k for FT migration mode (Kemari)",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
-- 
1.7.0.31.g1df487

--

From: Yoshiaki Tamura
Date: Tuesday, April 20, 2010 - 10:57 pm

To utilize ft_transaction function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    5 +++++
 savevm.c |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 10e6dda..fcee660 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -70,6 +70,8 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_transaction(int fd);
+QEMUFile *qemu_fopen_tranx_sender(void *opaque);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -81,6 +83,9 @@ void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
 void qemu_put_vector_prepare(QEMUFile *f);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_transaction_begin(QEMUFile *f);
+int qemu_transaction_commit(QEMUFile *f);
+int qemu_transaction_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index a401b27..292ae32 100644
--- a/savevm.c
+++ b/savevm.c
@@ -82,6 +82,7 @@
 #include "migration.h"
 #include "qemu_socket.h"
 #include "qemu-queue.h"
+#include "ft_transaction.h"
 
 /* point to the block driver where the snapshots are managed */
 static BlockDriverState *bs_snapshots;
@@ -210,6 +211,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
     return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+    QEMUFileSocket *s = opaque;
+    ssize_t len;
+
+    do {
+        len = send(s->fd, (void *)buf, size, 0);
+    } while (len == -1 && socket_error() == EINTR);
+
+    if (len == -1)
+        ...
From: Yoshiaki Tamura
Date: Tuesday, April 20, 2010 - 10:57 pm

QEMUFile currently doesn't support writev().  For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 buffered_file.c |    2 +-
 hw/hw.h         |   16 ++++++++++++++++
 savevm.c        |   43 +++++++++++++++++++++++++------------------
 3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
     s->wait_for_unfreeze = wait_for_unfreeze;
     s->close = close;
 
-    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+    s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
                              buffered_close, buffered_rate_limit,
                              buffered_set_rate_limit,
 			     buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
 typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
                                     int64_t pos, int size);
 
+/* This function writes a chunk of vector to a file at the given position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+                                    int64_t pos, int iovcnt);
+
 /* Read a chunk of data from a file at the given position.  The pos argument
  * can be ignored if the file is only be used for streaming.  The number of
  * bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
 typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
                                     int64_t pos, int size);
 
+/* Read a chunk of vector from a file at the given position.  The pos argument
+ * can be ignored if ...
From: Anthony Liguori
Date: Thursday, April 22, 2010 - 12:28 pm

Is there performance data that backs this up?  Since QEMUFile uses a 
linear buffer for most operations that's limited to 16k, I suspect you 
wouldn't be able to observe a difference in practice.

Regards,


--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 8:37 pm

I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages at once.

I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small as 

--

From: Anthony Liguori
Date: Friday, April 23, 2010 - 6:22 am

Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.

Regards,

Anthony Liguori

--

From: Avi Kivity
Date: Friday, April 23, 2010 - 6:48 am

Copying always introduces more cache pollution, so even if the data is 
in the cache, it is worthwhile (not disagreeing with the need to measure).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Yoshiaki Tamura
Date: Monday, May 3, 2010 - 2:32 am

Anthony,

I measure how long it takes to send all guest pages during migration, and I
would like to share the information in this message.  For convenience,
I modified
the code to do migration not "live migration" which means buffered file is not
used here.

In summary, the performance improvement using writev instead of write/send when
we used GbE seems to be neglectable, however, when the underlying network was
fast (InfiniBand with IPoIB in this case), writev performed 17% faster than
write/send, and therefore, it may be worthwhile to introduce vectors.

Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let
QEMU to transfer fine number of pages.  After setting up the guest, I used
cpu_get_real_ticks() to measure the time during the while loop calling
ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit() to
disable the function of buffered file, and all of the pages would be transfered
at the first round.

I measure 10 times for each, and took average and standard deviation.
Considering the results, I think the trial number was enough.  In addition to
time duration, number of writev/write and number of pages which were compressed
(dup)/not compressed (nodup) are demonstrated.

Test Environment:
CPU: 2x Intel Xeon Dual Core 3GHz
Mem size: 6GB
Network: GbE, InfiniBand (IPoIB)

Host OS: Fedora 11 (kernel 2.6.34-rc1)
Guest OS: Fedora 11 (kernel 2.6.33)
Guest Mem size: 512MB

* GbE writev
time (sec): 35.732 (std 0.002)
write count: 4 (std 0)
writev count: 8269 (std 1)
dup count: 36157 (std 124)
nodup count: 1016808 (std 147)

* GbE write
time (sec): 35.780 (std 0.164)
write count: 127367 (21)
writev count: 0 (std 0)
dup count: 36134 (std 108)
nodup count: 1016853 (std 165)

* IPoIB writev
time (sec): 13.889 (std 0.155)
write count: 4 (std 0)
writev count: 8267 (std 1)
dup count: 36147 (std 105)
nodup count: 1016838 (std 111)

* IPoIB write
time (sec): 16.777 (std 0.239)
write count: 127364 (24)
writev ...
From: Anthony Liguori
Date: Monday, May 3, 2010 - 5:05 am

Okay.  It looks like it's clear that it's a win so let's split it out of 
the main series and we'll treat it separately.  I imagine we'll see even 
more positive results on 10 gbit and particularly if we move migration 
out into a separate thread.

Regards,


--

From: Yoshiaki Tamura
Date: Monday, May 3, 2010 - 8:36 am

Great!
I also wanted to test with 10GE but I'm physically away from my office
now, and can't set up the test environment.  I'll measure the numbers
w/ 10GE next week.

BTW, I was thinking to write a patch to separate threads for both
sender and receiver of migration.  Kemari especially needs a separate
thread receiver, so that monitor can accepts commands from other HA
tools.  Is someone already working on this?  If not, I would add it to
my task list :-)

Thanks,

--

From: Anthony Liguori
Date: Monday, May 3, 2010 - 9:07 am

So far, no one (to my knowledge at least), is working on this.

Regards,


--

From: Yoshiaki Tamura
Date: Monday, April 26, 2010 - 3:43 am

I agree.

--

From: Yoshiaki Tamura
Date: Tuesday, April 20, 2010 - 10:57 pm

This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 Makefile.objs    |    1 +
 ft_transaction.c |  423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ft_transaction.h |   57 ++++++++
 migration.c      |    3 +
 4 files changed, 484 insertions(+), 0 deletions(-)
 create mode 100644 ft_transaction.c
 create mode 100644 ft_transaction.h

diff --git a/Makefile.objs b/Makefile.objs
index b73e2cb..4388fb3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -78,6 +78,7 @@ common-obj-y += qemu-char.o savevm.o #aio.o
 common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += qemu-config.o block-migration.o
+common-obj-y += ft_transaction.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_transaction.c b/ft_transaction.c
new file mode 100644
index 0000000..d0cbc99
--- /dev/null
+++ b/ft_transaction.c
@@ -0,0 +1,423 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori        <aliguori@us.ibm.com>
+ */
+
+#include "qemu-common.h"
+#include "hw/hw.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "qemu-char.h"
+#include "ft_transaction.h"
+
+// #define DEBUG_FT_TRANSACTION
+
+typedef struct QEMUFileFtTranx
+{
+    FtTranxPutBufferFunc *put_buffer;
+    FtTranxPutVectorFunc *put_vector;
+    FtTranxGetBufferFunc *get_buffer;
+    ...

Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
ram_save_block() and ram_save_remaining() to use
cpu_physical_memory_get_dirty_range() to check multiple dirty and
non-dirty pages at once.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 vl.c |  221 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 197 insertions(+), 24 deletions(-)

diff --git a/vl.c b/vl.c
index 729c955..9c3dc4c 100644
--- a/vl.c
+++ b/vl.c
@@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
     return 1;
 }
 
-static int ram_save_block(QEMUFile *f)
+typedef struct RAMSaveIO RAMSaveIO;
+
+struct RAMSaveIO {
+    QEMUFile *f;
+    QEMUIOVector *qiov;
+
+    uint8_t *ram_store;
+    size_t nalloc, nused;
+    uint8_t io_mode;
+
+    void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
+    void (*put_byte)(RAMSaveIO *s, int v);
+    void (*put_be64)(RAMSaveIO *s, uint64_t v);
+
+};
+
+static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
+{
+    qemu_put_vector(s->f, s->qiov);
+    if (prepare)
+        qemu_put_vector_prepare(s->f);
+
+    /* reset stored data */
+    qemu_iovec_reset(s->qiov);
+    s->nused = 0;
+}
+
+static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    s->put_buffer(s, buf, len);
+}
+
+static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
+{
+    s->put_byte(s, v);
+}
+
+static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
+{
+    s->put_be64(s, v);
+}
+
+static inline void ram_saveio_set_error(RAMSaveIO *s)
+{
+    qemu_file_set_error(s->f);
+}
+
+static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    qemu_iovec_add(s->qiov, buf, len);
+}
+
+static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+    qemu_put_buffer(s->f, buf, len);
+}
+
+static void ...
From: Dor Laor
Date: Thursday, April 22, 2010 - 1:58 am

IMO any type of network even should be stalled too. What if the VM runs 
non tcp protocol and the packet that the master node sent reached some 
remote client and before the sync to the slave the master failed?


Why do you specifically care about the tsc sync? When you sync all the 
IO model on snapshot it also synchronizes the tsc.

In general, can you please explain the 'algorithm' for continuous 
snapshots (is that what you like to do?):
A trivial one would we to :
  - do X online snapshots/sec
  - Stall all IO (disk/block) from the guest to the outside world
    until the previous snapshot reaches the slave.
  - Snapshots are made of
    - diff of dirty pages from last snapshot
    - Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce 
snapshot time.

I wrote the above to serve a reference for your comments so it will map 

--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 3:35 am

In current implementation, it is actually stalling any type of network that goes 
through virtio-net.

However, if the application was using unreliable protocols, it should have its 

Yes, of course.

I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the guest was 
almost idle, there will be no snapshots in 5sec.  On the other hand, if the 
guest was running I/O intensive workloads (netperf, iozone for example), there 



This also depends on the workload.

We're currently sending full copy because we're completely reusing this part of 
existing live migration framework.

Last time we measured, it was about 13KB.


Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to implement 
too.  In fact, we needed a hack to prevent rip from proceeding in KVM, which 
turned out that it was not the best workaround.

Thanks,


--

From: Takuya Yoshikawa
Date: Thursday, April 22, 2010 - 4:36 am

50 is too small: this depends on the synchronization speed and does not
show how many snapshots we need, right?
--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 5:35 am

No it doesn't.
It's an example data which I measured before.
--

From: Dor Laor
Date: Thursday, April 22, 2010 - 5:19 am

Why do you treat tcp differently? You can damage the entire VM this way 
- think of dhcp request that was dropped on the moment you switched 

So, do you agree that an extra clock synchronization is not needed since 

The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much 

There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
   standard live migration)
- Stop + fork + cont the father

Or mark the recent dirty pages that were not sent to the remote as write 

--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 6:16 am

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there

I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for

Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, Time to transfer VM + Time
to get response from the receiver.

It's hard to say which load is the hardest.
Memory intensive load, who don't generate I/O often, will suffer from
long sync time for that moment, but would have chances to continue its
process until sync.
I/O intensive load, who don't dirty much pages, will suffer from


I think I had that suggestion from Avi before.
And yes, it's very fascinating.

Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU.
Before going into further implementation, I wanted to check that I'm
--

From: Anthony Liguori
Date: Thursday, April 22, 2010 - 1:33 pm

I'm slightly confused about the current implementation vs. my 
recollection of the original paper with Xen.  I had thought that all 
disk and network I/O was buffered in such a way that at each checkpoint, 
the I/O operations would be released in a burst.  Otherwise, you would 
have to synchronize after every I/O operation which is what it seems the 
current implementation does.  I'm not sure how that is accomplished 
atomically though since you could have a completed I/O operation 
duplicated on the slave node provided it didn't notify completion prior 
to failure.

Is there another kemari component that somehow handles buffering I/O 
that is not obvious from these patches?

Regards,

Anthony Liguori


--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 6:53 pm

Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 

That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.

No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)

Thanks,

--

From: Anthony Liguori
Date: Friday, April 23, 2010 - 6:20 am

If NodeA is the master and NodeB is the slave, if NodeA sends a network 
packet, you'll checkpoint before the packet is actually sent, and then 
if a failure occurs before the next checkpoint, won't that result in 
both NodeA and NodeB sending out a duplicate version of the packet?

Regards,

Anthony Liguori

--

From: Yoshiaki Tamura
Date: Monday, April 26, 2010 - 3:44 am

Yes.  But I think it's better than taking checkpoint after.

If we checkpoint after sending packet, let's say it sent TCP ACK to the client, 
and if a hardware failure occurred to NodeA during the transaction *but the 
client received the TCP ACK*, NodeB will resume from the previous state, and it 
may need to receive some data from the client. However, because the client has 
already receiver TCP ACK, it won't resend the data to NodeB.  It looks this 
data is going to be dropped.

Anyway, I've just started planning to move the sync point to network/block 
layer, and I would post the result for discussion again.
--

From: Dor Laor
Date: Thursday, April 22, 2010 - 1:38 pm

What if the guest is running dhcp server? It we provide an IP to a 
client and then fail to the secondary that will run without knowing the 

First the huge cost of snapshots won't match to any real time app.
Second, even if it wasn't the case, the tsc delta and kvmclock are 
synchronized as part of the VM state so there is no use of trapping it 

--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 10:17 pm

That's problematic.  So it needs to sync when dhcp ack is sent.

I should apologize for my misunderstanding and explanation.  I agree that we 


I should study the clock in KVM, but won't tsc get updated by the HW after 
migration?
I was wondering the following case for example:

1. The application on the guest calls rdtsc on host A.
2. The application uses rdtsc value for something.
3. Failover to host B.
4. The application on the guest replays the rdtsc call on host B.
5. If the rdtsc value is different between A and B, the application may get into 
trouble because of it.


--

From: Fernando Luis Vázquez Cao
Date: Friday, April 23, 2010 - 12:36 am

Regarding the TSC, we need to guarantee that the guest sees a monotonic
TSC after migration, which can be achieved by adjusting the TSC offset properly.
Besides, we also need a trapping TSC, so that we can tackle the case where the
primary node and the standby node have different TSC frequencies.
--

From: Dor Laor
Date: Sunday, April 25, 2010 - 2:52 pm

You're right but this is already taken care of by normal save/restore 
process. Check void kvm_load_tsc(CPUState *env) function.

--

From: Jamie Lokier
Date: Thursday, April 22, 2010 - 9:15 am

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie
--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 5:20 pm

Even current implementation syncs just before network output, what you pointed 
out could happen.  In this case, would the connection going to be lost, or would 
client/server recover from it?  If latter, it would be fine, otherwise I wonder 

So the problem is, when the slave takes over, it sends a packet with same TTL 

Thank you for your comments.

Although I haven't prepared good data yet, I personally prefer to have 
discussion with actual implementation and experimental data.
--

From: Jamie Lokier
Date: Friday, April 23, 2010 - 8:07 am

In the case of TCP in a "synchronised state", I think it will recover
according to the rules in RFC793.  In an "unsynchronised state"
(during connection), I'm not sure if it recovers or if it looks like a
"Connection reset" error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie
--

From: Anthony Liguori
Date: Thursday, April 22, 2010 - 12:42 pm

This series looks quite nice!

I think it would make sense to separate out the things that are actually 
optimizations (like the dirty bitmap changes and the writev/readv 
changes) and to attempt to justify them with actual performance data.

I'd prefer not to modify the live migration protocol ABI and it doesn't 
seem to be necessary if we're willing to add options to the -incoming 
flag.  We also want to be a bit more generic with respect to IO.  
Otherwise, the series looks very close to being mergable.

Regards,

Anthony Liguori
--

From: Yoshiaki Tamura
Date: Thursday, April 22, 2010 - 5:45 pm

I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while 
you were offline (Sorry, if I was wrong).  Could you also take a look?

http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, otherwise 
it should be removed.  We attemped to do everything that may reduce the overhead 

I totally agree with your approach not to change the protocol ABI.  Can we add 
an option to -incoming?  Like, -incoming ft_mode, for example

Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and 
struggled for how to implement in an acceptable way.

Thanks,


--

From: Anthony Liguori
Date: Friday, April 23, 2010 - 6:10 am

Yes, I've seen it and I don't disagree.  That said, there ought to be 
perf data in the commit log so that down the road, the justification is 

The series looks very good.  I'm eager to see this functionality merged.

Regards,


--

From: Avi Kivity
Date: Friday, April 23, 2010 - 6:24 am

We discussed moving the barrier to the actual output device, instead of 
the I/O port.  This allows you to complete the I/O transaction before 
starting synchronization.

Does it not work for some reason?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Yoshiaki Tamura
Date: Monday, April 26, 2010 - 3:44 am

Sorry, I've just started working on that.
I've posted this series to share what I have done so far.
Thanks for looking.
--