This patch implements a prototype of kexec multi-stage load. With this patch, the "backup pages map" can be passed to kexeced kernel via /sbin/kexec; and the sys_kexec_load can be used to load large hibernated image with huge number of segments. In kexec based hibernation, resuming from disk is implemented as loading the hibernated disk image with sys_kexec_load(). But unlike the normal kexec load, the hibernated image may have huge number of segments. So multi-stage loading is necessary for kexec load based resuming from disk implementation. And, multi-stage loading is also necessary for parameter passing from original kernel to kexeced kernel because some information such as "backup pages map" is not available before loading. Four stages are defined: - KS_start: start stage; begin a new kexec loading; there must be only one KS_start stage in one kexec loading. - KS_mid: middle stage; continue load some segments; there may be many or zero KS_mid stages in one kexec loading; follows a KS_start or KS_mid stage. - KS_final: final stage; finish a kexec loading; there must be only one KS_final stage in one kexec loading; follows a KS_start or KS_mid stage. - KS_full: back compatible with original loading semantics, finish all work of a kexec loading in one KS_full stage. Overlapping between pages of different segments is allowed to support "parameter passing". During loading, a hash table mapped from destination page to source page is used instead of original linear mapping implementation. Because the hibernated image may be very large (up to near the size of physical memory), it is very time-consuming to search a source page given the destination page, which is used to check whether an newly allocated page is in the range of allocated destination pages. The original mapping is only used by assembly code to swap the page contents. This map is also exported to user space via /proc/kexec_pgmap, so that /sbin/kexec can use it to construct ...
Hi Huang, Had a quick look at the patch. Will review in detail soon. Had few thoughts. In general, these patches are on top of previous kexec jump patches. It would be good if you could repost your updated patches so that I can apply the patches and and get some testing going. Last time I tried the patches (V9) and kexec jump did not work for me. I was not getting timer interrupts in second kernel. Then I had to put LAPIC and IOAPIC in legacy mode and then at one way jump started working. I am not sure how the next kernel boots for you without putting APICs in legacy mode. (Yet to make returning back to original kernel work I understand that hibernated images are huge. But why do we require multi stage loading? I knew there was a maximum segment limit in kexec. But I think we can change that limit. Anything else prevents us from This seems to be an optimization of kexec so that it becomes efficient in loading large images (containing large number of segments). Probably this can be a separate patch. IMHO, we can just first write a minimal patch where one can just switch between kernels. Once that patch is upstream, we can enhance it to do the hibernation and saving core functionality. Incremental review becomes easier. Your last patch (v9) was a good attempt at that and Is kexec_jump v9 patch good enough or you have anohter internal version of patch on top of this patch applies? Thanks Vivek --
Hi, Vivek, The kexec jump patch v9 is sufficient for this patch to work. I have no Can normal kexec (without kexec jump) works without putting LAPIC and IOAPIC in legacy mode? Does this mean we should put LAPIC and IOAPIC into legacy mode before kexec and restore them after? The kexec jump patch works well on my IBM T42. But it seems that the IOAPIC is disabled in BIOS, so I can only use i8259 and LAPIC on this There are two reason for multi-stage loading: - Pass backup pages map from original kernel (A) to kexeced kernel (B), because it is not known before loading. We have discussed this before in: http://lkml.org/lkml/2008/3/12/308 http://lkml.org/lkml/2008/3/14/59 http://lkml.org/lkml/2008/3/21/299 - Load large hibernated image. The hibernated image can be not only large but also discontinuous. For example, the physical memory size is 4G, and there is one free page every 2 pages, that is, there will be nearly 2G segments. Loading these segments in one go is impossible. So multi-stage load is necessary. And if the hibernated image is compressed, it is also very difficult to load it in one go because the Agreed. We can first focus on kexec jump patch. But as in last thread of kexec jump (v9), we need a protocol for parameter passing between kernel A and kernel B. So, we can use this patch as a prototype for the v9 is the latest kexec jump patch, no other internal version so far. Best Regards, Huang Ying --
We do put LAPIC and IOAPIC in legacy mode in normal kexec. Look at disable_IO_APIC() in native_machine_shutdown(). So I think we shall I went through above mail thread again where we were discussing what all information need to be passed between kernels. Last time we enumerated three things. - kernel entry/re-entry point for switch between kernels. - backup pages map for core filtering - Probably ELF core notes for saving hibernated image. I think if we just implement the functionality so that one can switch back and forth between kernels (no hibernated image saving),then we probably need to pass around only kernel entry/re-entry point and nothing else and in your patches I think you are already doing using %edi. So, IMHO, for first simple implementation, we don't have to pass around any data between kernels except entry point. (Please correct me if I am wrong). Lets get that implementation in first and then we can get rest Great. I got busy in other stuff last time. Will download the v9 again and give it a try. Thanks Vivek --
On Tue, 2008-05-13 at 22:56 -0400, Vivek Goyal wrote: Yes. Kernel entry/re-entry point is the only information need to be communicated between kernels for just switching between them. So we can focus on kexec jump patch firstly. Best Regards, Huang Ying --
Then as a preliminary design let's plan on this. - Pass the rentry point as the return address (using the C ABI). We may want to load the stack pointer etc so we can act as a direct entry point for new code. - Look at passing a pointer to the mapping of pages that the kexec trampoline uses in arg1 of the C ABI. Largely the format is defacto fixed anyway because we need to pass the structure from C to assembly. Using the standard C ABI makes things much it much easier to pick a calling convention, and to document it. Eric --
You mean pass image->head to purgatory of /sbin/kexec using arg1 of C Yes. Best Regards, Huang Ying --
On Wed, 2008-05-14 at 14:43 -0700, Eric W. Biederman wrote: There are some issues about passing entry point as return address. The kexec jump (or kexec with return) is used for - Switching between original kernel (A) and kexeced kernel (B) - Call some code (such as BIOS code) in physical mode 1) When call some code in physical mode, the called code can use a simple return to return to kernel A. So there is no return address on stack after return to kernel A. Instead, argument 1 is on stack top. 2) When switch back from kernel B to kernel A, kernel B will call the jump back entry of kernel A with C ABI. So, the return address is on stack top. And kernel A get jump back entry of kernel B via the return address. Because the stack state is different between 1) and 2), the jump back entry of kernel A should distinguish them. Possible solution can be as follow: a) Before kernel A call some physical mode code or kernel B, it set argument 1 to be a magic number that can not be return address (such as -1). Jump back entry of kernel A can check whether the stack top is argument 1 or return address. b) Distinguish by return address. Such as, called physical mode code must return 0, while kernel B must set %eax to some other number. c) Use different entry point for 1) and 2). Two entry points are deduced from return address. Such as: entry1 = return_address; entry2 = return_address & ~0xfff; /* page aligned */ entry1 is used by physical mode code. entry2 is used by kernel B. Which one is better? Or some other solution? Best Regards, Huang Ying --
Yes. Because the stack state is different we need to be careful. However I don't see that we care how we got to the proper piece of code. If we don't care we don't need to distinguish them. Therefore I see two possible solutions. 1) Write a tiny trampoline that goes in the core file to keep the calling conventions sane. 2) After we figure out our address read the stack pointer from a fixed location and simply set it. (This is my preference) Eric --
On Thu, 2008-05-15 at 11:39 -0700, Eric W. Biederman wrote: Just for confirmation (My English is poor). Do you mean that kernel A just read the stack top as re-entry point, regardless of whether it is return address or argument 1? Best Regards, Huang Ying --
What I was thinking was:
In kernel A()
relocate_new_kernel:
...
call *%eax
kexec_jump_back_entry:
/* This code should be PIC so figure out where we are */
call 1f
1:
popl %edi
subl $(1b - relocate_kernel), %edi
/* Setup a safe stack */
leal PAGE_SIZE(%edi), %esp
...
Then in purgatory we can read the address of kexec_jump_back_entry
by examining 0(%esp) and export it in whatever fashion is sane.
However we reach kexec_jump_back_entry we should be fine.
Eric
--
I think it is reasonable to enable jumping back and forth more than one time. So the following should be possible: 1. Jump from A to B (actually jump to purgatory, trigger the boot of B) 2. Jump from B to A 3. Jump from A to B again (jump to the kexec_jump_back_entry of B) 4. Jump from B to A ... So it should be possible to get the re-entry point of kernel B in kexec_jump_back_entry of kernel A too. So I think in kexec_jump_back_entry, the caller's stack should be checked to get re-entry point of peer. And the stack state is different depend on where come from, from relocate_new_kernel() or return. Best Regards, Huang Ying --
Huang is making use of purgatory only for booting kernel B for the first
time. Once the kernel B is booted, all the trasitions (A-->B and B<--A)
happen without using purgatory. Just keep on jumping back and forth
to "kexec_jump_back_entry".
Probably not using purgatory for later transitions is justified as long as
kernel code is simple and small. Otherwise we will shall have to teach
To me this idea also looks good. So control flow will look something
as follows?
relocate_new kernel:
if (!preserve_context)
set registers to known state.
jump to purgatory.
else
goto jump-back-setup:
jump-back-setup:
- Color the stack.
move $0xffffffff 0(%esp)
- call %edx
kexec_jump_back_entry:
- If 0 (%esp) is not -1
image->start = 0(%esp) //Re entry point of kernel B. Store it.
else
We returned from BIOS call. Re-entry point has not changed
Do nothing.
- Continue to resume kernel A
Thanks
Vivek
--
Thinking more about it, probably we don't have to separate out preserve context and normal kexec path. Both can transition to purgatory using call %edx. Coloring the stack should not harm in normal kexec. Thanks Vivek --
That logic has more conditionals then I like but it may in fact be reasonable. I don't have any fundamental objections into making this a co-routine interface. That said. I think immediately implementing a coroutine interface is a premature optimization. Please let's work on call/return. Then prototype the coroutine method of suspend to swap and see how much time it saves us. Honestly I will be surprised if time will be saved, as historically at least the bottleneck in kernel startup time is initializing hardware, and we need to essentially redo all of that initialization. Eric --
(And we go through purgatory which remembers Yes. Any conditional logic needs to be in purgatory or a similar trampoline. Eric --
IMHO, this kind of make more sense to me when keeping C function like semantics in mind. Both the cases can be treated like calls to functions (calling BIOS function and jumping to kernel B). The basic difference between two cases is the re-entry point. In BIOS function case, we always re-enter the function at the start but in case of kernel B, except first entry, all other entries happen at a run time determined address, which needs to be communicated to kernel A. I would think that second kernel B just should execute "ret" and new entry address of kernel B is passed to kernel A through %eax (return value of function). Not sure if BIOS routines can always return a fix code so that we can differentiate between two cases. Thanks Vivek --
On Thu, 2008-05-15 at 22:00 -0400, Vivek Goyal wrote: The disadvantage of this solution is that kernel B must know it is original kernel (A) or kexeced kernel (B). Different code should be used by kernel A and kernel B. And after jump from A to B, jump from B to A, when jump from A to B again, kernel A must use different code from the first time. Best Regards, Huang Ying --
I don't know what the case is for keeping two kernels in memory and switching between them. I suspect a small piece of trampoline code between the two kernels could handle the case. (i.e. purgatory pays attention). That is a fundamental aspect of the design. A general purpose infrastructure with trampoline code to adapt it to whatever situation comes up. Eric --
This can be used to save the memory image of kernel B and accelerate the
It is possible to use purgatory to deal with this problem.
Jump from kernel A to kernel B
Jump to entry of purgatory (purgatory_entry)
purgatory save the return address (kexec_jump_back_entry_A)
Purgatory set kexec_jump_back_entry for kernel B to a code
segment in purgatory, say kexec_jump_back_entry_A_for_B
Purgatory jump to entry point of kernel B
Jump from kernel B to kernel A
Jump to purgatory (kexec_jump_back_entry_A_for_B)
Purgatory save the return address (kexec_jump_back_entry_B)
Purgatory return to kernel A (kexec_jump_back_entry_A)
Jump from kernel A to kernel B again
Jump to entry of purgatory (purgatory_entry)
Purgatory save the return address (kexec_jump_back_entry_A)
Purgatory jump to kexec_jump_back_entry_B
The disadvantage of this solution is that some information is saved in
purgatory (kexec_jump_back_entry_A, kexec_jump_back_entry_B). So,
purgatory must be saved too when save the memory image of kernel A or
kernel B. Purgatory can be seen as a part of kernel B. But it is a
little tricky to think it as a part of kernel A too.
Best Regards,
Huang Ying
--
That's a good point. Remembering the actual return points in purgatory will require purgatory to be saved along with core file. I think, purgatory is a good infrastructure for transitions between the kernels but at the same time, here it is a matter of just making a "call" and then inspecting the stack in kexec_jump_back_entry. IMHO, we can keep it simple and not involving purgatory in later transitions. Thanks Vivek --
Yes, please. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
