Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. Before going into details, we would like to show how Kemari looks. We prepared a demonstration video at the following location. For those who are not interested in the code, please take a look. The demonstration scenario is, 1. Play with a guest VM that has virtio-blk and virtio-net. # The guest image should be a NFS/SAN. 2. Start Kemari to synchronize the VM by running the following command in QEMU. Just add "-k" option to usual migrate command. migrate -d -k tcp:192.168.0.20:4444 3. Check the status by calling info migrate. 4. Go back to the VM to play chess animation. 5. Kill the the VM. (VNC client also disappears) 6. Press "c" to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the chess animation ends, browser works fine, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov The repository contains all patches we're sending with ...
When -k option is set to migrate command, it will turn on ft_mode to
start FT migration mode (Kemari).
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
migration.c | 3 +++
qemu-monitor.hx | 7 ++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/migration.c b/migration.c
index c81fdb4..b288e82 100644
--- a/migration.c
+++ b/migration.c
@@ -109,6 +109,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
return -1;
}
+ if (qdict_get_int(qdict, "ft"))
+ ft_mode = FT_INIT;
+
if (strstart(uri, "tcp:", &p)) {
s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
(int)qdict_get_int(qdict, "blk"),
diff --git a/qemu-monitor.hx b/qemu-monitor.hx
index 16c45b7..22b72d9 100644
--- a/qemu-monitor.hx
+++ b/qemu-monitor.hx
@@ -765,13 +765,14 @@ ETEXI
{
.name = "migrate",
- .args_type = "detach:-d,blk:-b,inc:-i,uri:s",
- .params = "[-d] [-b] [-i] uri",
+ .args_type = "detach:-d,blk:-b,inc:-i,ft:-k,uri:s",
+ .params = "[-d] [-b] [-i] [-k] uri",
.help = "migrate to URI (using -d to not wait for completion)"
"\n\t\t\t -b for migration without shared storage with"
" full copy of disk\n\t\t\t -i for migration without "
"shared storage with incremental copy of disk "
- "(base image shared between src and destination)",
+ "(base image shared between src and destination)"
+ "\n\t\t\t -k for FT migration mode (Kemari)",
.user_print = monitor_user_noop,
.mhandler.cmd_new = do_migrate,
},
--
1.7.0.31.g1df487
--
To utilize ft_transaction function, savevm needs interfaces to be
exported.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
hw/hw.h | 5 +++++
savevm.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 46 insertions(+), 0 deletions(-)
diff --git a/hw/hw.h b/hw/hw.h
index 10e6dda..fcee660 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -70,6 +70,8 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
QEMUFile *qemu_fopen(const char *filename, const char *mode);
QEMUFile *qemu_fdopen(int fd, const char *mode);
QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_transaction(int fd);
+QEMUFile *qemu_fopen_tranx_sender(void *opaque);
QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
int qemu_stdio_fd(QEMUFile *f);
@@ -81,6 +83,9 @@ void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov);
void qemu_put_vector_prepare(QEMUFile *f);
void *qemu_realloc_buffer(QEMUFile *f, int size);
void qemu_clear_buffer(QEMUFile *f);
+int qemu_transaction_begin(QEMUFile *f);
+int qemu_transaction_commit(QEMUFile *f);
+int qemu_transaction_cancel(QEMUFile *f);
static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
{
diff --git a/savevm.c b/savevm.c
index a401b27..292ae32 100644
--- a/savevm.c
+++ b/savevm.c
@@ -82,6 +82,7 @@
#include "migration.h"
#include "qemu_socket.h"
#include "qemu-queue.h"
+#include "ft_transaction.h"
/* point to the block driver where the snapshots are managed */
static BlockDriverState *bs_snapshots;
@@ -210,6 +211,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
return len;
}
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+ QEMUFileSocket *s = opaque;
+ ssize_t len;
+
+ do {
+ len = send(s->fd, (void *)buf, size, 0);
+ } while (len == -1 && socket_error() == EINTR);
+
+ if (len == -1)
+ ...QEMUFile currently doesn't support writev(). For sending multiple
data, such as pages, using writev() should be more efficient.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
buffered_file.c | 2 +-
hw/hw.h | 16 ++++++++++++++++
savevm.c | 43 +++++++++++++++++++++++++------------------
3 files changed, 42 insertions(+), 19 deletions(-)
diff --git a/buffered_file.c b/buffered_file.c
index 54dc6c2..187d1d4 100644
--- a/buffered_file.c
+++ b/buffered_file.c
@@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque,
s->wait_for_unfreeze = wait_for_unfreeze;
s->close = close;
- s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL,
+ s->file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL,
buffered_close, buffered_rate_limit,
buffered_set_rate_limit,
buffered_get_rate_limit);
diff --git a/hw/hw.h b/hw/hw.h
index fc9ed29..921cf90 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -23,6 +23,13 @@
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
int64_t pos, int size);
+/* This function writes a chunk of vector to a file at the given position.
+ * The pos argument can be ignored if the file is only being used for
+ * streaming.
+ */
+typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov,
+ int64_t pos, int iovcnt);
+
/* Read a chunk of data from a file at the given position. The pos argument
* can be ignored if the file is only be used for streaming. The number of
* bytes actually read should be returned.
@@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
int64_t pos, int size);
+/* Read a chunk of vector from a file at the given position. The pos argument
+ * can be ignored if ...Is there performance data that backs this up? Since QEMUFile uses a linear buffer for most operations that's limited to 16k, I suspect you wouldn't be able to observe a difference in practice. Regards, --
I currently don't have data, but I'll prepare it. There were two things I wanted to avoid. 1. Pages to be copied to QEMUFile buf through qemu_put_buffer. 2. Calling write() everytime even when we want to send multiple pages at once. I think 2 may be neglectable. But 1 seems to be problematic if we want make to the latency as small as --
Copying often has strange CPU characteristics depending on whether the data is already in cache. It's better to drive these sort of optimizations through performance measurement because changes are not always obvious. Regards, Anthony Liguori --
Copying always introduces more cache pollution, so even if the data is in the cache, it is worthwhile (not disagreeing with the need to measure). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Anthony, I measure how long it takes to send all guest pages during migration, and I would like to share the information in this message. For convenience, I modified the code to do migration not "live migration" which means buffered file is not used here. In summary, the performance improvement using writev instead of write/send when we used GbE seems to be neglectable, however, when the underlying network was fast (InfiniBand with IPoIB in this case), writev performed 17% faster than write/send, and therefore, it may be worthwhile to introduce vectors. Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let QEMU to transfer fine number of pages. After setting up the guest, I used cpu_get_real_ticks() to measure the time during the while loop calling ram_save_block() in ram_save_live(). I removed the qemu_file_rate_limit() to disable the function of buffered file, and all of the pages would be transfered at the first round. I measure 10 times for each, and took average and standard deviation. Considering the results, I think the trial number was enough. In addition to time duration, number of writev/write and number of pages which were compressed (dup)/not compressed (nodup) are demonstrated. Test Environment: CPU: 2x Intel Xeon Dual Core 3GHz Mem size: 6GB Network: GbE, InfiniBand (IPoIB) Host OS: Fedora 11 (kernel 2.6.34-rc1) Guest OS: Fedora 11 (kernel 2.6.33) Guest Mem size: 512MB * GbE writev time (sec): 35.732 (std 0.002) write count: 4 (std 0) writev count: 8269 (std 1) dup count: 36157 (std 124) nodup count: 1016808 (std 147) * GbE write time (sec): 35.780 (std 0.164) write count: 127367 (21) writev count: 0 (std 0) dup count: 36134 (std 108) nodup count: 1016853 (std 165) * IPoIB writev time (sec): 13.889 (std 0.155) write count: 4 (std 0) writev count: 8267 (std 1) dup count: 36147 (std 105) nodup count: 1016838 (std 111) * IPoIB write time (sec): 16.777 (std 0.239) write count: 127364 (24) writev ...
Okay. It looks like it's clear that it's a win so let's split it out of the main series and we'll treat it separately. I imagine we'll see even more positive results on 10 gbit and particularly if we move migration out into a separate thread. Regards, --
Great! I also wanted to test with 10GE but I'm physically away from my office now, and can't set up the test environment. I'll measure the numbers w/ 10GE next week. BTW, I was thinking to write a patch to separate threads for both sender and receiver of migration. Kemari especially needs a separate thread receiver, so that monitor can accepts commands from other HA tools. Is someone already working on this? If not, I would add it to my task list :-) Thanks, --
So far, no one (to my knowledge at least), is working on this. Regards, --
I agree. --
This code implements VM transaction protocol. Like buffered_file, it
sits between savevm and migration layer. With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
Makefile.objs | 1 +
ft_transaction.c | 423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
ft_transaction.h | 57 ++++++++
migration.c | 3 +
4 files changed, 484 insertions(+), 0 deletions(-)
create mode 100644 ft_transaction.c
create mode 100644 ft_transaction.h
diff --git a/Makefile.objs b/Makefile.objs
index b73e2cb..4388fb3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -78,6 +78,7 @@ common-obj-y += qemu-char.o savevm.o #aio.o
common-obj-y += msmouse.o ps2.o
common-obj-y += qdev.o qdev-properties.o
common-obj-y += qemu-config.o block-migration.o
+common-obj-y += ft_transaction.o
common-obj-$(CONFIG_BRLAPI) += baum.o
common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_transaction.c b/ft_transaction.c
new file mode 100644
index 0000000..d0cbc99
--- /dev/null
+++ b/ft_transaction.c
@@ -0,0 +1,423 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ * Anthony Liguori <aliguori@us.ibm.com>
+ */
+
+#include "qemu-common.h"
+#include "hw/hw.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "qemu-char.h"
+#include "ft_transaction.h"
+
+// #define DEBUG_FT_TRANSACTION
+
+typedef struct QEMUFileFtTranx
+{
+ FtTranxPutBufferFunc *put_buffer;
+ FtTranxPutVectorFunc *put_vector;
+ FtTranxGetBufferFunc *get_buffer;
+ ...Introduce RAMSaveIO to use writev for saving ram blocks, and modifies
ram_save_block() and ram_save_remaining() to use
cpu_physical_memory_get_dirty_range() to check multiple dirty and
non-dirty pages at once.
Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
vl.c | 221 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 197 insertions(+), 24 deletions(-)
diff --git a/vl.c b/vl.c
index 729c955..9c3dc4c 100644
--- a/vl.c
+++ b/vl.c
@@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch)
return 1;
}
-static int ram_save_block(QEMUFile *f)
+typedef struct RAMSaveIO RAMSaveIO;
+
+struct RAMSaveIO {
+ QEMUFile *f;
+ QEMUIOVector *qiov;
+
+ uint8_t *ram_store;
+ size_t nalloc, nused;
+ uint8_t io_mode;
+
+ void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len);
+ void (*put_byte)(RAMSaveIO *s, int v);
+ void (*put_be64)(RAMSaveIO *s, uint64_t v);
+
+};
+
+static inline void ram_saveio_flush(RAMSaveIO *s, int prepare)
+{
+ qemu_put_vector(s->f, s->qiov);
+ if (prepare)
+ qemu_put_vector_prepare(s->f);
+
+ /* reset stored data */
+ qemu_iovec_reset(s->qiov);
+ s->nused = 0;
+}
+
+static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+ s->put_buffer(s, buf, len);
+}
+
+static inline void ram_saveio_put_byte(RAMSaveIO *s, int v)
+{
+ s->put_byte(s, v);
+}
+
+static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v)
+{
+ s->put_be64(s, v);
+}
+
+static inline void ram_saveio_set_error(RAMSaveIO *s)
+{
+ qemu_file_set_error(s->f);
+}
+
+static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+ qemu_iovec_add(s->qiov, buf, len);
+}
+
+static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t len)
+{
+ qemu_put_buffer(s->f, buf, len);
+}
+
+static void ...IMO any type of network even should be stalled too. What if the VM runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?
Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.
In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):
A trivial one would we to :
- do X online snapshots/sec
- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.
- Snapshots are made of
- diff of dirty pages from last snapshot
- Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.
I wrote the above to serve a reference for your comments so it will map
--
In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its Yes, of course. I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there This also depends on the workload. We're currently sending full copy because we're completely reusing this part of existing live migration framework. Last time we measured, it was about 13KB. Thank your for the guidance. I hope this answers to your question. At the same time, I would also be happy it we could discuss how to implement too. In fact, we needed a hack to prevent rip from proceeding in KVM, which turned out that it was not the best workaround. Thanks, --
50 is too small: this depends on the synchronization speed and does not show how many snapshots we need, right? --
No it doesn't. It's an example data which I measured before. --
Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched So, do you agree that an extra clock synchronization is not needed since The hardest would be memory intensive loads. So 100 snap/sec means latency of 10msec right? (not that it's not ok, with faster hw and IB you'll be able to get much There are brute force solutions like - stop the guest until you send all of the snapshot to the remote (like standard live migration) - Stop + fork + cont the father Or mark the recent dirty pages that were not sent to the remote as write --
I'm not trying to say that we should treat tcp differently, but just it's severe. In case of dhcp request, the client would have a chance to retry after failover, correct? BTW, in current implementation, it's synchronizing before dhcp ack is sent. But in case of tcp, once you send ack to the client before sync, there I agree that its sent as part of the live migration. What I wanted to say here is that this is not something for real time applications. I usually get questions like can this guarantee fault tolerance for Doesn't 100 snap/sec mean the interval of snap is 10msec? IIUC, to get the latency, you need to get, Time to transfer VM + Time to get response from the receiver. It's hard to say which load is the hardest. Memory intensive load, who don't generate I/O often, will suffer from long sync time for that moment, but would have chances to continue its process until sync. I/O intensive load, who don't dirty much pages, will suffer from I think I had that suggestion from Avi before. And yes, it's very fascinating. Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU. Before going into further implementation, I wanted to check that I'm --
I'm slightly confused about the current implementation vs. my recollection of the original paper with Xen. I had thought that all disk and network I/O was buffered in such a way that at each checkpoint, the I/O operations would be released in a burst. Otherwise, you would have to synchronize after every I/O operation which is what it seems the current implementation does. I'm not sure how that is accomplished atomically though since you could have a completed I/O operation duplicated on the slave node provided it didn't notify completion prior to failure. Is there another kemari component that somehow handles buffering I/O that is not obvious from these patches? Regards, Anthony Liguori --
Yes, you're almost right. It's synchronizing before QEMU starts emulating I/O at each device model. It was originally designed that way to avoid complexity of introducing buffering That's exactly the point I wanted to discuss. Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is complete. I mimicked what existing live migration is doing. No, I'm not hiding anything, and I would share any information regarding Kemari to develop it in this community :-) Thanks, --
If NodeA is the master and NodeB is the slave, if NodeA sends a network packet, you'll checkpoint before the packet is actually sent, and then if a failure occurs before the next checkpoint, won't that result in both NodeA and NodeB sending out a duplicate version of the packet? Regards, Anthony Liguori --
Yes. But I think it's better than taking checkpoint after. If we checkpoint after sending packet, let's say it sent TCP ACK to the client, and if a hardware failure occurred to NodeA during the transaction *but the client received the TCP ACK*, NodeB will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to NodeB. It looks this data is going to be dropped. Anyway, I've just started planning to move the sync point to network/block layer, and I would post the result for discussion again. --
What if the guest is running dhcp server? It we provide an IP to a client and then fail to the secondary that will run without knowing the First the huge cost of snapshots won't match to any real time app. Second, even if it wasn't the case, the tsc delta and kvmclock are synchronized as part of the VM state so there is no use of trapping it --
That's problematic. So it needs to sync when dhcp ack is sent. I should apologize for my misunderstanding and explanation. I agree that we I should study the clock in KVM, but won't tsc get updated by the HW after migration? I was wondering the following case for example: 1. The application on the guest calls rdtsc on host A. 2. The application uses rdtsc value for something. 3. Failover to host B. 4. The application on the guest replays the rdtsc call on host B. 5. If the rdtsc value is different between A and B, the application may get into trouble because of it. --
Regarding the TSC, we need to guarantee that the guest sees a monotonic TSC after migration, which can be achieved by adjusting the TSC offset properly. Besides, we also need a trapping TSC, so that we can tackle the case where the primary node and the standby node have different TSC frequencies. --
You're right but this is already taken care of by normal save/restore process. Check void kvm_load_tsc(CPUState *env) function. --
Even with unreliable protocols, if slave takeover causes the receiver to have received a packet that the sender _does not think it has ever sent_, expect some protocols to break. If the slave replaying master's behaviour since the last sync means it will definitely get into the same state of having sent the packet, that works out. But you still have to be careful that the other end's responses to that packet are not seen by the slave too early during that replay. Otherwise, for example, the slave may observe a TCP ACK to a packet that it hasn't yet sent, which is an error. About IP idempotency: In general, IP packets are allowed to be lost or duplicated in the network. All IP protocols should be prepared for that; it is a basic property. However there is one respect in which they're not idempotent: The TTL field should be decreased if packets are delayed. Packets should not appear to live in the network for longer than TTL seconds. If they do, some protocols (like TCP) can react to the delayed ones differently, such as sending a RST packet and breaking a connection. It is acceptable to reduce TTL faster than the minimum. After all, it That is a really satisfying number, thank you :-) Without this work I wouldn't have imagined that synchronised machines could work with such a low transaction rate. -- Jamie --
Even current implementation syncs just before network output, what you pointed out could happen. In this case, would the connection going to be lost, or would client/server recover from it? If latter, it would be fine, otherwise I wonder So the problem is, when the slave takes over, it sends a packet with same TTL Thank you for your comments. Although I haven't prepared good data yet, I personally prefer to have discussion with actual implementation and experimental data. --
In the case of TCP in a "synchronised state", I think it will recover according to the rules in RFC793. In an "unsynchronised state" (during connection), I'm not sure if it recovers or if it looks like a "Connection reset" error. I suspect it does recover but I'm not certain. But that's TCP. Other protocols, such as over UDP, may behave differently, because this is not an anticipated behaviour of a Yes. I guess this is a general problem with time-based protocols and virtual machines getting stopped for 1 minute (say), without knowing that real time has moved on for the other nodes. Some application transaction, caching and locking protocols will give wrong results when their time assumptions are discontinuous to such a large degree. It's a bit nasty to impose that on them after they worked so hard on their reliability :-) However, I think such implementations _could_ be made safe if those programs can arrange to definitely be interrupted with a signal when the discontinuity happens. Of course, only if they're aware they may be running on a Kemari system... I have an intuitive idea that there is a solution to that, but each time I try to write the next paragraph explaining it, some little complication crops up and it needs more thought. Something about concurrent, asynchronous transactions to keep the master running while recording the minimum states that replay needs to be safe, while slewing the replaying slave's virtual clock back to real time quickly during recovery mode. -- Jamie --
This series looks quite nice! I think it would make sense to separate out the things that are actually optimizations (like the dirty bitmap changes and the writev/readv changes) and to attempt to justify them with actual performance data. I'd prefer not to modify the live migration protocol ABI and it doesn't seem to be necessary if we're willing to add options to the -incoming flag. We also want to be a bit more generic with respect to IO. Otherwise, the series looks very close to being mergable. Regards, Anthony Liguori --
I agree with the separation plan. For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while you were offline (Sorry, if I was wrong). Could you also take a look? http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html Regarding writev, I agree that it should be backed with actual data, otherwise it should be removed. We attemped to do everything that may reduce the overhead I totally agree with your approach not to change the protocol ABI. Can we add an option to -incoming? Like, -incoming ft_mode, for example Thank you for your comment on each patch. To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and struggled for how to implement in an acceptable way. Thanks, --
Yes, I've seen it and I don't disagree. That said, there ought to be perf data in the commit log so that down the road, the justification is The series looks very good. I'm eager to see this functionality merged. Regards, --
We discussed moving the barrier to the actual output device, instead of the I/O port. This allows you to complete the I/O transaction before starting synchronization. Does it not work for some reason? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. --
Sorry, I've just started working on that. I've posted this series to share what I have done so far. Thanks for looking. --
