Re: [PATCH] kexec based hibernation: a prototype of kexec multi-stage load

Previous thread: mips:(Alchemy au1200) how to use kdb to debug by abhiruchi.g on Sunday, May 11, 2008 - 10:54 pm. (2 messages)

Next thread: Re: 2.6.25 crash: EIP: [<c02e2f14>] xfrm_output_resume+0x64/0x100 ss:esp 0068:c03a1e5c by Marco Berizzi on Monday, May 12, 2008 - 12:14 am. (2 messages)
From: Huang, Ying
Date: Sunday, May 11, 2008 - 11:40 pm

This patch implements a prototype of kexec multi-stage load. With this
patch, the &quot;backup pages map&quot; can be passed to kexeced kernel via
/sbin/kexec; and the sys_kexec_load can be used to load large
hibernated image with huge number of segments.


In kexec based hibernation, resuming from disk is implemented as
loading the hibernated disk image with sys_kexec_load(). But unlike
the normal kexec load, the hibernated image may have huge number of
segments. So multi-stage loading is necessary for kexec load based
resuming from disk implementation. And, multi-stage loading is also
necessary for parameter passing from original kernel to kexeced kernel
because some information such as &quot;backup pages map&quot; is not available
before loading.


Four stages are defined:

- KS_start: start stage; begin a new kexec loading; there must be only
  one KS_start stage in one kexec loading.

- KS_mid: middle stage; continue load some segments; there may be many
  or zero KS_mid stages in one kexec loading; follows a KS_start or
  KS_mid stage.

- KS_final: final stage; finish a kexec loading; there must be only
  one KS_final stage in one kexec loading; follows a KS_start or
  KS_mid stage.

- KS_full: back compatible with original loading semantics, finish all
  work of a kexec loading in one KS_full stage.


Overlapping between pages of different segments is allowed to support
&quot;parameter passing&quot;.


During loading, a hash table mapped from destination page to source
page is used instead of original linear mapping
implementation. Because the hibernated image may be very large (up to
near the size of physical memory), it is very time-consuming to search
a source page given the destination page, which is used to check
whether an newly allocated page is in the range of allocated
destination pages. The original mapping is only used by assembly code
to swap the page contents. This map is also exported to user space via
/proc/kexec_pgmap, so that /sbin/kexec can use it to construct ...
From: Vivek Goyal
Date: Monday, May 12, 2008 - 10:34 pm

Hi Huang,

Had a quick look at the patch. Will review in detail soon. Had few
thoughts.

In general, these patches are on top of previous kexec jump patches.
It would be good if you could repost your updated patches so that
I can apply the patches and and get some testing going.

Last time I tried the patches (V9) and kexec jump did not work for me. I
was not getting timer interrupts in second kernel. Then I had to put 
LAPIC and IOAPIC in legacy mode and then at one way jump started working.
I am not sure how the next kernel boots for you without putting APICs
in legacy mode. (Yet to make returning back to original kernel work

I understand that hibernated images are huge. But why do we require
multi stage loading? I knew there was a maximum segment limit in kexec.
But I think we can change that limit. Anything else prevents us from

This seems to be an optimization of kexec so that it becomes efficient
in loading large images (containing large number of segments). Probably
this can be a separate patch.

IMHO, we can just first write a minimal patch where one can just switch
between kernels. Once that patch is upstream, we can enhance
it to do the hibernation and saving core functionality. Incremental
review becomes easier. Your last patch (v9) was a good attempt at that and

Is kexec_jump v9 patch good enough or you have anohter internal version
of patch on top of this patch applies?

Thanks
Vivek
--

From: Huang, Ying
Date: Tuesday, May 13, 2008 - 6:57 pm

Hi, Vivek,


The kexec jump patch v9 is sufficient for this patch to work. I have no

Can normal kexec (without kexec jump) works without putting LAPIC and
IOAPIC in legacy mode? Does this mean we should put LAPIC and IOAPIC
into legacy mode before kexec and restore them after?

The kexec jump patch works well on my IBM T42. But it seems that the
IOAPIC is disabled in BIOS, so I can only use i8259 and LAPIC on this

There are two reason for multi-stage loading:

- Pass backup pages map from original kernel (A) to kexeced kernel (B),
because it is not known before loading. We have discussed this before
in:
	http://lkml.org/lkml/2008/3/12/308
	http://lkml.org/lkml/2008/3/14/59
	http://lkml.org/lkml/2008/3/21/299

- Load large hibernated image. The hibernated image can be not only
large but also discontinuous. For example, the physical memory size is
4G, and there is one free page every 2 pages, that is, there will be
nearly 2G segments. Loading these segments in one go is impossible. So
multi-stage load is necessary. And if the hibernated image is
compressed, it is also very difficult to load it in one go because the


Agreed. We can first focus on kexec jump patch. But as in last thread of
kexec jump (v9), we need a protocol for parameter passing between kernel
A and kernel B. So, we can use this patch as a prototype for the

v9 is the latest kexec jump patch, no other internal version so far.

Best Regards,
Huang Ying
--

From: Vivek Goyal
Date: Tuesday, May 13, 2008 - 7:56 pm

We do put LAPIC and IOAPIC in legacy mode in normal kexec. Look at 
disable_IO_APIC() in native_machine_shutdown(). So I think we shall


I went through above mail thread again where we were discussing what all
information need to be passed between kernels.

Last time we enumerated three things.

- kernel entry/re-entry point for switch between kernels.
- backup pages map for core filtering
- Probably ELF core notes for saving hibernated image.

I think if we just implement the functionality so that one can switch
back and forth between kernels (no hibernated image saving),then we probably
need to pass around only kernel entry/re-entry point and nothing else and in
your patches I think you are already doing using %edi.

So, IMHO, for first simple implementation, we don't have to pass around
any data between kernels except entry point. (Please correct me if I am 
wrong). Lets get that implementation in first and then we can get rest

Great. I got busy in other stuff last time. Will download the v9 again
and give it a try.

Thanks
Vivek
--

From: Huang, Ying
Date: Tuesday, May 13, 2008 - 8:37 pm

On Tue, 2008-05-13 at 22:56 -0400, Vivek Goyal wrote:



Yes. Kernel entry/re-entry point is the only information need to be
communicated between kernels for just switching between them. So we can
focus on kexec jump patch firstly.

Best Regards,
Huang Ying

--

From: Eric W. Biederman
Date: Wednesday, May 14, 2008 - 2:43 pm

Then as a preliminary design let's plan on this.

- Pass the rentry point as the return address (using the C ABI).
  We may want to load the stack pointer etc so we can act as
  a direct entry point for new code.

- Look at passing a pointer to the mapping of pages that the kexec
  trampoline uses in arg1 of the C ABI.  Largely the format is defacto
  fixed anyway because we need to pass the structure from C to
  assembly.

Using the standard C ABI makes things much it much easier to pick
a calling convention, and to document it.

Eric
--

From: Huang, Ying
Date: Wednesday, May 14, 2008 - 7:40 pm

You mean pass image-&gt;head to purgatory of /sbin/kexec using arg1 of C

Yes.

Best Regards,
Huang Ying

--

From: Huang, Ying
Date: Wednesday, May 14, 2008 - 9:57 pm

On Wed, 2008-05-14 at 14:43 -0700, Eric W. Biederman wrote:

There are some issues about passing entry point as return address. The
kexec jump (or kexec with return) is used for

- Switching between original kernel (A) and kexeced kernel (B)
- Call some code (such as BIOS code) in physical mode

1) When call some code in physical mode, the called code can use a
simple return to return to kernel A. So there is no return address on
stack after return to kernel A. Instead, argument 1 is on stack top.

2) When switch back from kernel B to kernel A, kernel B will call the
jump back entry of kernel A with C ABI. So, the return address is on
stack top. And kernel A get jump back entry of kernel B via the return
address.

Because the stack state is different between 1) and 2), the jump back
entry of kernel A should distinguish them. Possible solution can be as
follow:

a) Before kernel A call some physical mode code or kernel B, it set
argument 1 to be a magic number that can not be return address (such as
-1). Jump back entry of kernel A can check whether the stack top is
argument 1 or return address.

b) Distinguish by return address. Such as, called physical mode code
must return 0, while kernel B must set %eax to some other number.

c) Use different entry point for 1) and 2). Two entry points are deduced
from return address. Such as:

entry1 = return_address;
entry2 = return_address &amp; ~0xfff;	/* page aligned */

entry1 is used by physical mode code. entry2 is used by kernel B.


Which one is better? Or some other solution?

Best Regards,
Huang Ying

--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 11:39 am

Yes.  Because the stack state is different we need to be careful.

However I don't see that we care how we got to the proper piece of
code.  If we don't care we don't need to distinguish them.

Therefore I see two possible solutions.
1) Write a tiny trampoline that goes in the core file to keep
   the calling conventions sane.

2) After we figure out our address read the stack pointer from
   a fixed location and simply set it.  (This is my preference)

Eric
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 6:41 pm

On Thu, 2008-05-15 at 11:39 -0700, Eric W. Biederman wrote:

Just for confirmation (My English is poor).

Do you mean that kernel A just read the stack top as re-entry point,
regardless of whether it is return address or argument 1?

Best Regards,
Huang Ying

--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 7:25 pm

What I was thinking was:

In kernel A()

relocate_new_kernel:

        ...

        call	*%eax

kexec_jump_back_entry:
        /* This code should be PIC so figure out where we are */
        call	1f
1:
        popl	%edi
        subl	$(1b - relocate_kernel), %edi

        /* Setup a safe stack */
        leal    PAGE_SIZE(%edi), %esp
        ...


Then in purgatory we can read the address of kexec_jump_back_entry
by examining 0(%esp) and export it in whatever fashion is sane.

However we reach kexec_jump_back_entry we should be fine.

Eric
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 7:56 pm

I think it is reasonable to enable jumping back and forth more than one
time. So the following should be possible:

1. Jump from A to B (actually jump to purgatory, trigger the boot of B)
2. Jump from B to A
3. Jump from A to B again (jump to the kexec_jump_back_entry of B)
4. Jump from B to A
...

So it should be possible to get the re-entry point of kernel B in
kexec_jump_back_entry of kernel A too. So I think in
kexec_jump_back_entry, the caller's stack should be checked to get
re-entry point of peer. And the stack state is different depend on where
come from, from relocate_new_kernel() or return.

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Thursday, May 15, 2008 - 8:27 pm

Huang is making use of purgatory only for booting kernel B for the first
time. Once the kernel B is booted, all the trasitions (A--&gt;B and B&lt;--A)
happen without using purgatory. Just keep on jumping back and forth
to &quot;kexec_jump_back_entry&quot;.

Probably not using purgatory for later transitions is justified as long as
kernel code is simple and small. Otherwise we will shall have to teach

To me this idea also looks good. So control flow will look something
as follows?

relocate_new kernel:
	
	if (!preserve_context)
		set registers to known state.
		jump to purgatory.
	else
		goto jump-back-setup:

jump-back-setup:
- Color the stack.
  move $0xffffffff 0(%esp)

- call %edx

kexec_jump_back_entry:

- If 0 (%esp) is not -1
	image-&gt;start = 0(%esp)  //Re entry point of kernel B. Store it.
  else
	We returned from BIOS call. Re-entry point has not changed
        Do nothing.

- Continue to resume kernel A

Thanks
Vivek
 
--

From: Vivek Goyal
Date: Friday, May 16, 2008 - 6:40 am

Thinking more about it, probably we don't have to separate out preserve
context and normal kexec path. Both can transition to purgatory using
call %edx. Coloring the stack should not harm in normal kexec.

Thanks
Vivek
--

From: Eric W. Biederman
Date: Saturday, May 17, 2008 - 6:59 pm

That logic has more conditionals then I like but it may in
fact be reasonable.  I don't have any fundamental objections
into making this a co-routine interface.

That said.  I think immediately implementing a coroutine interface
is a premature optimization.  Please let's work on call/return.
Then prototype the coroutine method of suspend to swap and see
how much time it saves us.

Honestly I will be surprised if time will be saved, as historically
at least the bottleneck in kernel startup time is initializing
hardware, and we need to essentially redo all of that initialization.

Eric

--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 8:33 pm

(And we go through purgatory which remembers

Yes.

Any conditional logic needs to be in purgatory or a similar trampoline.

Eric
--

From: Vivek Goyal
Date: Thursday, May 15, 2008 - 7:00 pm

IMHO, this kind of make more sense to me when keeping C function like
semantics in mind.

Both the cases can be treated like calls to functions (calling BIOS function
and jumping to kernel B). The basic difference between two cases is the
re-entry point. In BIOS function case, we always re-enter the function at the
start but in case of kernel B, except first entry, all other entries happen
at a run time determined address, which needs to be communicated to kernel A.

I would think that second kernel B just should execute &quot;ret&quot; and new entry
address of kernel B is passed to kernel A through %eax (return value of
function).

Not sure if BIOS routines can always return a fix code so that we can
differentiate between two cases.

Thanks
Vivek
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 7:19 pm

On Thu, 2008-05-15 at 22:00 -0400, Vivek Goyal wrote:

The disadvantage of this solution is that kernel B must know it is
original kernel (A) or kexeced kernel (B). Different code should be used
by kernel A and kernel B. And after jump from A to B, jump from B to A,
when jump from A to B again, kernel A must use different code from the
first time.

Best Regards,
Huang Ying

--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 7:55 pm

I don't know what the case is for keeping two kernels in memory and switching
between them.

I suspect a small piece of trampoline code between the two kernels could
handle the case. (i.e. purgatory pays attention).

That is a fundamental aspect of the design.  A general purpose infrastructure
with trampoline code to adapt it to whatever situation comes up.

Eric
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 9:52 pm

This can be used to save the memory image of kernel B and accelerate the

It is possible to use purgatory to deal with this problem.

Jump from kernel A to kernel B
    Jump to entry of purgatory (purgatory_entry)
    purgatory save the return address (kexec_jump_back_entry_A)
    Purgatory set kexec_jump_back_entry for kernel B to a code
        segment in purgatory, say kexec_jump_back_entry_A_for_B
    Purgatory jump to entry point of kernel B
Jump from kernel B to kernel A
    Jump to purgatory (kexec_jump_back_entry_A_for_B)
    Purgatory save the return address (kexec_jump_back_entry_B)
    Purgatory return to kernel A (kexec_jump_back_entry_A)
Jump from kernel A to kernel B again
    Jump to entry of purgatory (purgatory_entry)
    Purgatory save the return address (kexec_jump_back_entry_A)
    Purgatory jump to kexec_jump_back_entry_B

The disadvantage of this solution is that some information is saved in
purgatory (kexec_jump_back_entry_A, kexec_jump_back_entry_B). So,
purgatory must be saved too when save the memory image of kernel A or
kernel B. Purgatory can be seen as a part of kernel B. But it is a
little tricky to think it as a part of kernel A too.

Best Regards,
Huang Ying
--

From: Vivek Goyal
Date: Friday, May 16, 2008 - 6:36 am

That's a good point. Remembering the actual return points in purgatory
will require purgatory to be saved along with core file.

I think, purgatory is a good infrastructure for transitions between the
kernels but at the same time, here it is a matter of just making a &quot;call&quot;
and then inspecting the stack in kexec_jump_back_entry. IMHO, we can keep it
simple and not involving purgatory in later transitions.

Thanks
Vivek
--

Previous thread: mips:(Alchemy au1200) how to use kdb to debug by abhiruchi.g on Sunday, May 11, 2008 - 10:54 pm. (2 messages)

Next thread: Re: 2.6.25 crash: EIP: [<c02e2f14>] xfrm_output_resume+0x64/0x100 ss:esp 0068:c03a1e5c by Marco Berizzi on Monday, May 12, 2008 - 12:14 am. (2 messages)