> On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>>> Dor Laor wrote:
>>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We have been implementing the prototype of Kemari for KVM, and
>>>>>>>> we're
>>>>>>>> sending
>>>>>>>> this message to share what we have now and TODO lists.
>>>>>>>> Hopefully, we
>>>>>>>> would like
>>>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>>>> advanced
>>>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>>>> this project
>>>>>>>> step by step while absorbing comments from the community. The
>>>>>>>> current
>>>>>>>> code is
>>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>>>
>>>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>>>> following RFC which we posted last year.
>>>>>>>>
>>>>>>>>
http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>>
>>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>>> logic is
>>>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent
>>>>>>>> rip
>>>>>>>> from
>>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>>> plumbing in
>>>>>>>> the
>>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>>> instructions,
>>>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>>>> stack, as well
>>>>>>>> as for optimization purposes, for example.
>>>>>>> [ snap]
>>>>>>>
>>>>>>>> The rest of this message describes TODO lists grouped by each
>>>>>>>> topic.
>>>>>>>>
>>>>>>>> === event tapping ===
>>>>>>>>
>>>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>>>> which
>>>>>>>> event the
>>>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>>>> here is
>>>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>>>> for
>>>>>>>> disk I/O
>>>>>>>> and reliable network protocols such as TCP.
>>>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>>>> runs
>>>>>>> non tcp protocol and the packet that the master node sent reached
>>>>>>> some
>>>>>>> remote client and before the sync to the slave the master failed?
>>>>>> In current implementation, it is actually stalling any type of
>>>>>> network
>>>>>> that goes through virtio-net.
>>>>>>
>>>>>> However, if the application was using unreliable protocols, it should
>>>>>> have its own recovering mechanism, or it should be completely
>>>>>> stateless.
>>>>> Why do you treat tcp differently? You can damage the entire VM this
>>>>> way -
>>>>> think of dhcp request that was dropped on the moment you switched
>>>>> between
>>>>> the master and the slave?
>>>> I'm not trying to say that we should treat tcp differently, but just
>>>> it's severe.
>>>> In case of dhcp request, the client would have a chance to retry after
>>>> failover, correct?
>>>> BTW, in current implementation,
>>>
>>> I'm slightly confused about the current implementation vs. my
>>> recollection of the original paper with Xen. I had thought that all disk
>>> and network I/O was buffered in such a way that at each checkpoint, the
>>> I/O operations would be released in a burst. Otherwise, you would have
>>> to synchronize after every I/O operation which is what it seems the
>>> current implementation does.
>>
>> Yes, you're almost right.
>> It's synchronizing before QEMU starts emulating I/O at each device model.
>
> If NodeA is the master and NodeB is the slave, if NodeA sends a network
> packet, you'll checkpoint before the packet is actually sent, and then
> if a failure occurs before the next checkpoint, won't that result in
> both NodeA and NodeB sending out a duplicate version of the packet?