Re: [linux-pm] [PATCH -mm] kexec jump -v9

Previous thread: Re: [PATCH 05/11] tifm: fix the MemoryStick host fifo handling code by Alex Dubov on Wednesday, March 5, 2008 - 7:30 pm. (1 message)

Next thread: 你想清貨嗎?快進來kRc0 by tnmccb rygci on Wednesday, March 5, 2008 - 8:25 pm. (1 message)
From: Huang, Ying
Date: Wednesday, March 5, 2008 - 8:13 pm

This is a minimal patch with only the essential features. All
additional features are split out and can be discussed later. I think
it may be easier to get consensus on this minimal patch.

Best Regards,
Huang Ying

------------------------------------>

This patch provides an enhancement to kexec/kdump. It implements
the following features:

- Jumping between the original kernel and the kexeced kernel.

- Backup/restore memory used by both the original kernel and the
  kexeced kernel.

- Save/restore CPU and devices state before after kexec.


The features of this patch can be used for as follow:

- A simple hibernation implementation without ACPI support. You can
  kexec a hibernating kernel, save the memory image of original system
  and shutdown the system. When resuming, you restore the memory image
  of original system via ordinary kexec load then jump back.

- Kernel/system debug through making system snapshot. You can make
  system snapshot, jump back, do some thing and make another system
  snapshot.

- Cooperative multi-kernel/system. With kexec jump, you can switch
  between several kernels/systems quickly without boot process except
  the first time. This appears like swap a whole kernel/system out/in.

- A general method to call program in physical mode (paging turning
  off). This can be used to invoke BIOS code under Linux.


The following user-space tools can be used with kexec jump:

- kexec-tools needs to be patched to support kexec jump. The patches
  and the precompiled kexec can be download from the following URL:
       source: http://khibernation.sourceforge.net/download/release_v9/kexec-tools/kexec-tools-src_gi...
       patches: http://khibernation.sourceforge.net/download/release_v9/kexec-tools/kexec-tools-patche...
       binary: http://khibernation.sourceforge.net/download/release_v9/kexec-tools/kexec_git_kh9

- makedumpfile with patches are used as memory image saving tool, it
  can exclude free pages from ...
From: Vivek Goyal
Date: Tuesday, March 11, 2008 - 2:10 pm

Hi Huang,

This patchset is slowly getting better. True that first we need to come
up with minimal infrastructure patch and then think of building more

The main usage of this functionality is for hibernation. I am not sure
what has been the conclusion of previous discussions.

Rafael/Pavel, does the approach of doing hibernation using a separate
kernel holds promise?


Why not store the entry point in dump.elf itself, instead of storing it
in a separate file?

I think this is more like a resumable core file. Something similar to
functionality what qemu does for resuming an already booted kernel image.
So we might have to introduce an ELF_NOTE to mark an image as resumable

How the memory segments of dump.elf loaded? Normal kexec way? Memory
segments of dump.elf are first stored somewhere and then moved to
destination at "kexec -e" time?

Does this really work? If we have 4G RAM, what will be the size of
dump.elf? And when we load it back for resuming, do we have sufficient
memory left?

May be we can have a separate load flag (--load-resume-image) to mark
that we are resuming an hibernated image and kexec does not have to
prepare commandline, does not have to prepare zero page/setup page etc.


I have thought through it again and try to put together some of the
new kexec options we can introduce to make the whole thing work. I am 
considering a simple case where a user boots the kernel A and then
launches kernel B using "kexec --load-preseve-context". Now a user
might save the hibernated image or might want to come back to A.

- kexec -l <kernel-image>
        Normal kexec functionality. Boot a new kernel, without preserving
        existing kernel's context.

- kexec --load-preserve-context <kernel-image>
        Boot a new kernel while preserving existing kernel's context.

        Will be used for booting kernel B for the first time.

- kexec --load-resume-image <resumable-core>
        Resumes an hibernated image. Load a ELF64 hibernated ...
From: Nigel Cunningham
Date: Tuesday, March 11, 2008 - 2:59 pm

Hi all.

I hope kexec turns out to be a good, usable solution. Unfortunately,
however, I still have some areas where I'm not convinced that kexec is
going to work or work well:

1. Reliability.

It's being sold as a replacement for freezing processes, yet AFAICS it's
still going to require the freezer in order to be reliable. In the
normal case, there isn't much of an issue with freeing memory or
allocating swap, and so these steps can be expected to progress without
pain. Imagine, however, the situation where another process or processes
are trying to allocate large amounts of memory at the same time, or the
system is swapping heavily. Although such situations will not be common,
they are entirely conceivable, and any implementation ought to be able
to handle such a situation efficiently. If the freezer is removed, any
hibernation implementation - not just kexec - is going to have a much
harder job of being reliable in all circumstances. AFAICS, the only way
a kexec based solution is going to be able to get around this will be to
not have to allocate memory, but that will require permanent allocation
of memory for the kexec kernel and it's work area as well as the
permanent, exclusive allocation of storage for the kexec hibernation
implementation that's currently in place (making the LCA complaint about
not being able to hibernate to swap on NTFS on fuse equally relevant). 

While this might be feasible on machines with larger amounts of memory
(you might validly be able to argue that a user won't miss 10MB of RAM),
it does make hibernation less viable or unviable for systems with less
memory (embedded!). It also means that there are 10MB of RAM (or
whatever amount) that the user has paid good money for, but which are
probably only used for 30s at a time a couple of times a day.

Any attempt to start to use storage available to the hibernating kernel
is also going to have these race issues.

2. Lack of ACPI support.

At the moment, noone is going to want to use kexec ...
From: Huang, Ying
Date: Tuesday, March 11, 2008 - 7:14 pm

As Eric said kexec need only to allocate memory during loading, not

I think this can be avoid such as preallocate some hard disk space (such
as a dedicate hibernating file, the block list are loaded by

ACPI is the biggest challenge for kexec based hibernation. I will try to
deal with it. But for most people, ACPI is not a big issue. This is

No, the newest implementation need not to boot a different kernel or
different bootloader entry. You just use one bootloader entry, it will
resume if there's an image, booting normally if there's not. You can
look at the newest hibernation example description.

And the new method can even be used to load hibernation image of uswsusp

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Wednesday, March 12, 2008 - 11:53 am

Yes. But this memory gets reserved at loading time and then this memory
remains unused for the whole duration (except hibernation).

In the example you gave, looks like you are reserving 15MB of memory for
second kernel. In practice, we we finding it difficult to boot a regular
kernel in 16MB of memory in kdump. We are now reserving 128MB of memory
for kdump kernel on x86 arch, otheriwse OOM kill kicks in during init
or while core is being copied.

Kexec based hibernation does not look any different than kdump in terms
of memory requirements. The only difference seems to be that kdump does
the contiguous memory reservation at boot time and kexec based hibernation
does the memory reservation at kernel loading time.

The only difference I can think of is, kdump will generally run on servers
and hibernation will be required on desktops/laptops and run time memory
requirements might be little different. I don't have numbers though.

At the same time carrying a separate kernel binary just for hibernation
purposes does not sound very good.
  

Following is the step from new method you have given.

7. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as
   root file system.

This mentions that use rootfs.gz as initrd. Without modifying the boot
loader entry, how would I switch the initrd dynamically.

Looks like it might be a typo. So basically we can just boot back into
normal kernel and then a user can load the resumable core file and kexec
to it?

I think all this functionality can be packed into normal initrd itself
to make user interface better.

A user can configure the destination for hibernated image at system
installation time and initrd will be modified accordingly to save the
hibernated image as well to check that user specfied location to find out
if a hibernation image is available and needs to be resumed.

Thanks
Vivek
--

From: Eric W. Biederman
Date: Wednesday, March 12, 2008 - 5:01 pm

Sounds like something we may want to fix.  Living at the default kernel

One difference is you only get the memory penalty just before you hibernate,
instead of continuously.  So potentially you could swap out things to

Yes.  And we don't need to load any of this until just before hibernation
time so we should be able to change things right up until the last moment.

Eric
--

From: david
Date: Tuesday, March 11, 2008 - 5:09 pm

I don't see any reason why this couldn't be done with an initrd to decide 
if you are doing a normal boot or a restore.

David Lang
--

From: Eric W. Biederman
Date: Tuesday, March 11, 2008 - 4:55 pm

Right.  I can address the memory concerns with a kexec based approach
as they are core to kexec and completely orthogonal to the rest.

A kexec in done in two passes.  The first to load the target kernel
and do whatever memory allocation is needed.  The second to actually
switch which kernel is running.

Using a linux kernel to save off the image or in any other way be the
target is not required it is simply the sane thing to do in a general
implementation.  An embedded developer could likely implement a save
to disk routing in a couple of hundred lines of C and a couple of K

Yep.  Although disk storage is frequently less expensive, and more
readily available, so this is less of an issue.  Still it does suggest


I completely agree here.

Eric
--

From: Huang, Ying
Date: Tuesday, March 11, 2008 - 6:45 pm

Yes. The entry point should be saved in dump.elf itself, this can be
done via a user-space tool such as "makedumpfile". Because
"makedumpfile" is also used to exclude free pages from disk image, it
needs a communication method between two kernels (to get backup pages
map or something like that from kernel A). We have talked about this
before.

- Your opinion is to communicate via the purgatory. (But I don't know
how to communicate between kernel A and purgatory).
- Eric's opinion is to communicate between the user space in kernel A
and user space in kernel B.
- My opinion is to communicate between two kernel directly.

I think as a minimal infrastructure patch, we can communicate minimal
information between user space of two kernels. When we have consensus on
this topic, we can use makedumpfile for both excluding free pages and
saving the entry point. Now, we can save the entry point in a separate

Yes. Exactly. But during kexec loading, if the source page is same as

Yes. It really works. If we have 4G RAM, the size of dump.elf is 4G -
(memory area used by second kernel), in this example, it is 4G - 16M.
The loading kernel will live in 16M memory, and load dump.elf into all

There is already similar flag in original kexec-tools implementation:
"--args-none". If it is specified, kexec-tools does not prepare command
line and zero page/setup page etc. I think we can just re-use this flag.

In original kexec-tools, this can be done through:
kexec -l --args-none <resumable-core>


In current implementation, this can be done through:
kexec --load-jump-back-helper --entry <entry-point>.


Best Regards,
Huang Ying
--

From: Vivek Goyal
Date: Wednesday, March 12, 2008 - 12:47 pm

On Wed, Mar 12, 2008 at 09:45:26AM +0800, Huang, Ying wrote:


Ok, we can get rid of --load-resume-image and go by the Eric's idea
of detecting image type and taking action accordingly.

Thanks
Vivek
--

From: Eric W. Biederman
Date: Tuesday, March 11, 2008 - 7:17 pm

Purgatory is for all intents and purposes user space.  Because the
return address falls on the trampoline page we won't know it's
address before we call kexec.  But a return address and a stack

We need a fixed protocol so we do not make assumptions about how things
will be implemented, allowing kernels to diverge and kinds of other
good things.

For communicating extra information from the kernel being shut down
we have elf notes.

Direct kernel to kernel communication is forbidden.  We must have
a well defined protocol.  Allowing the implementations to change

My gut feel is we look at the image and detect what kind it is, and simply
not enable image processing after we have read the note that says it

Make common cases fast to use.  The UI equivalent of make the
common case fast.

Eric

--

From: Huang, Ying
Date: Tuesday, March 11, 2008 - 11:54 pm

This sounds reasonable. But after some initial trying I found it is
fairly difficult for me to define a communication protocol to be back
compatible with original kexec/kdump, doing work in user space as far as
possible, dealing with some special scenario (such as: A kexec B, then B
kexec C). So I will try my best to work on this, and propose a
communication protocol combining the proposals from you and Vivek in

Yes. This sounds good.

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Wednesday, March 12, 2008 - 12:37 pm

I think he needs to pass on much more data than just return address. 

IIUC, he needs to pass backup pages map to new kernel, so that any
user space tool can use backup pages map to reconstruct/rearrange the
first kernel's memory core and tools like makedumpfile can do filtering
before hibernated images is saved.

This brings me to a random thought. Can we break the process of loading
a hibernation kernel in two steps.

- In first step just do the memory reservation for running second kernel.
  (kexec -l <dummpy-file-for-reserving-memory>)

- This memory map of reserved pages is exported to user space.

- Use this memory map and regenerate the hibernation kernel initrd
  (rootfs.gz) and put the memory map there. This memory map can be used
  by makedumpfile in second kernel for filtering.

This way it will user space to user space communication of information 

Agreed. Without a proper protocol, we will often run into issues that
X version of kernel does not work with Y version of hibernation kernel

That makes sense. Just that we shall have to put some kind of ELF NOTE
or some other identifier in resumable core file to identify it.

Thanks
Vivek
--

From: Huang, Ying
Date: Friday, March 14, 2008 - 1:03 am

Doing kexec load in two steps is a possible solution. Although this is a
little complex, we can wrap the two steps into one /sbin/kexec invoking.
That is, When do /sbin/kexec --load-preserve-context
<kernel-image>, /sbin/kexec first call sys_kexec_load() to load the
kernel image and reserving memory, then amend the memory image of loaded
kernel (B) according to the new information available such as return
address and backup pages map. For this solution, something still need to
be solved is how to pass some information back from kernel B
(hibernating kernel) to kernel A (original kernel) and how to pass some
information from kernel C (resuming kernel) to kernel A (original
kernel).

-----------------------------------------------------------------

Another possible solution to pass information between kernels (in user
space): needed information from kernel are passed in stack, and a
special ELF_NOTES is used to access the information in peer kernel.
Details is as follow:

1. Possible information need to be passed:

1.1 From user space (known before sys_kexec_load):

a. ELF core header
b. vmcoreinfo (pointer only)

1.2 From kernel space (known after sys_kexec_load):

a. jump back entry (return address)
b. backup pages map


2. When jumping from kernel A to kernel B:

2.1 In /sbin/kexec --load-preserve-context <kernel-image>, /sbin/kexec
allocate a special ELF_NOTES (ELF NOTES kernel) for information from
kernel space.

2.2 When doing sys_reboot(REBOOT_CMD_KEXEC), kernel put needed
information and physical address of ELF core header onto stack just
before jump to purgatory.

2.3 After jumping to purgatory, purgatory fills "ELF NOTES kernel" with
corresponding address in stack.

2.4 When kernel B is booted, /proc/vmcore is created and the information
form ELF NOTES kernel is available too.


3. When jumping back from kernel B to kernel A and jumping from kernel C
to kernel A:

3.1 Same as 2.1

3.2 Same as 2.2, but there is no purgatory in kernel A, so ...
From: Vivek Goyal
Date: Friday, March 21, 2008 - 12:12 pm

Hi Huang,

I am kind of ok with both the methods.

- Communicate information between two kernels using an ELF NOTE
  prepared by kernel.

- Communicate information between user space tools using initrd.

But which method to use will depend on what information we want to 
exchange between two kernels. 

For example, re-entry points can be on stack or in ELF NOTE.

Backup page map probably can be communicated using initrd as only user
space need to access that (ELF Core headers can be put in a memory area
which is not swapped during transition from kernel A to B. This way
kernel B never needs to know that kernel A had done some swapping of
pages?). 

So far I have understood only following.

1. We need to pass around entry/re-entry points between kernels.

2. We need to pass backup pages map from kernel A to kernel B, so that user
  space tool can do filtering.

3. We need to pass address of ELF core headers from kernel A to kernel B so
  that a valid vmcore of kernel A can be exported.

	- For first time boot of kernel B, address of ELF core header is
	  passed through command line.

	- For re-entry into B, ELF core header address can be passed
 	  using some register, or on stack or using kernel ELF NOTE.

What else? What information do we need to communicate from kernel B to 
kernel A or from kernel C to kernel A?

I am sure that you have told it in the past. Just that I don't recollect
it.

Thanks
Vivek
--

From: Huang, Ying
Date: Tuesday, March 25, 2008 - 12:25 am

On Fri, 2008-03-21 at 15:12 -0400, Vivek Goyal wrote:

I think the ELF_NOTES mechanism is sufficient for communication between
two kernel. Because it can be written from user space tool in the kernel
A (/sbin/kexec via sys_kexec_load), and read from user space tool in the
kernel B (via /proc/vmcore). It can be used as user space communication
mechanism. So I think it may be not necessary to communicate with
initrd.

If we want to load the hibernated image with sys_kexec_load (/sbin/kexec
-l), we must add "multiple stages loading" feature to sys_kexec_load.
Because the segments in the hibernated image can exceed
KEXEC_SEGMENT_MAX (16) easily, considering there will be many memory
holes when free pages are excluded. Multiple sys_kexec_load must be used
to load a normal hibernated image. If multiple stage loading is
unavoidable, I think the better method to communicate information like
"jump back entry" and "backup pages map" is "multiple stage loading"
like you said in previous mail. And they can be encapsulated as
ELF_NOTES. So the only information need to be passed on stack is address

ELF core headers are in destination memory range of kernel B, so they
can be accessed by kernel B directly without knowing pages swapping in

For now, there is no information need to be passed from kernel B/C to
kernel A. But I think in the future, there should be some ACPI related
information need to be passed in this way, such as from kernel C to
kernel A: whether system is restored from ACPI S4 or ACPI S5. So I think
it is necessary to make it possible to pass some information from kernel
B/C to kernel A. But I think an ELF core header and some memory is
sufficient to do this.

Best Regards,
Huang Ying

--

From: Rafael J. Wysocki
Date: Tuesday, March 11, 2008 - 3:18 pm

Well, what can I say?

I haven't been a big fan of doing hibernation this way since the very beginning
and I still have the same reservations.  Namely, my opinion is that the
hibernation-related problems we have are not just solvable this way.  For one
example, in order to stop using the freezer for suspend/hibernation we first
need to revamp the suspending/resuming of devices (uder way) and the
kexec-based approach doesn't help us here.  I wouldn't like to start another
discussion about it though.

That said, I can imagine some applications of the $subject functionality
not directly related to hibernation.  For example, one can use it for kernel
debgging (jump to a new kernel, change something in the old kernel's
data, jump back and see what happens etc.).  Also, in principle it may be used
for such things as live migration of VMs.

Thanks,
Rafael
--

From: Huang, Ying
Date: Tuesday, March 11, 2008 - 7:26 pm

Yes. We need to work on device drivers for all hibernation
implementations. And kexec-based hibernation provides a possible method
to avoid freezer after driver works done.

Best Regards,
Huang Ying

--

From: Eric W. Biederman
Date: Tuesday, March 11, 2008 - 7:02 pm

Agreed.  At best all this does is moving the policy on how to save the kernel

Also such things as calling BIOS services or EFI services on x86_64.  Where
vm86 is not useful.

So in principle I think a kexec with return is a logical extension to
the current kexec functionality.  

That said it looks like next month before I will have time to do a reasonable
job of reviewing the current patches.  

Eric
--

From: Pavel Machek
Date: Tuesday, March 11, 2008 - 4:24 pm

Its certainly "more traditional" method of doing hibernation than
tricks swsusp currently plays. Yes, I'd like these patches to go in,
being able to switch kernels seems like useful tool.

Now, I guess they are some difficulties, like ACPI integration, and
some basic drawbacks, like few seconds needed to boot second kernel
during suspend.

...OTOH this is probably only chance to eliminate freezer from
swsusp...

Yes, I'd like to see this to go ahead.

No, this does not make swsusp obsolete just yet.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Nigel Cunningham
Date: Tuesday, March 11, 2008 - 5:00 pm

Hi.


I think eliminating the freezer and having reliable hibernation under
load look like incompatible goals at the moment. Do you see that as 'not
a problem' or have some idea on how that issue can be addressed?

Regards,

Nigel

--

From: Rafael J. Wysocki
Date: Tuesday, March 11, 2008 - 4:49 pm

Some facts:

* In order to be able to do suspend (STR) without the freezer, we need to make
  device drivers block access to devices from applications during suspend.
* There's no reason to think that we can't use this same mechanism for
  hibernation (the only difficulty seems to be the handling of devices used for
  saving the image).
* We need the drivers to quiesce devices to be able to do the kexec jump in the
  first place (and to avoid races, we'll need them to block applications'
  access to devices just like for STR, which is the sufficient condition for
  removing the freezer).

So, I don't really think that the "freezer removal" argument is valid here.

Moreover, if this had been the _only_ argument for the $subject functionality,
I'd have been against it.

Thanks,
Rafael
--

From: Huang, Ying
Date: Tuesday, March 11, 2008 - 6:55 pm

Long long ago, the hibernation is not done by Linux kernel itself but
BIOS (APM). Those days, kernel just does some preparation and jump to
BIOS to do the hibernation. Imagine kernel B is the hibernation BIOS,
kernel A does some prepare and jump to the BIOS (kernel B) just like the

I think "kexec based hibernation" is the only currently available
possible method to write out image without freezer (after driver works
are done). If other process is running, how to prevent them from writing

Best Regards,
Huang Ying

--

From: Alan Stern
Date: Wednesday, March 12, 2008 - 8:01 am

This is a very good question.

It's a matter of managing the block layer's request queues.  Somehow 
the existing I/O requests must remain blocked while the requests needed 
for writing the image must be allowed to proceed.

I don't know what would be needed to make this work, but it ought to be 
possible somehow...

Alan Stern

--

From: Rafael J. Wysocki
Date: Wednesday, March 12, 2008 - 2:53 pm

Yes, it ought to be possible.

Ultimately, IMHO, we should put all devices unnecessary for saving the image
(and doing some eye-candy work) into low power states before the image is
created and keep them in low power states until the system is eventually
powered off.

If this is done, the remaining problem is the handling of the devices that we
need to save the image.  I believe that will be achievable without using the
freezer.

Thanks,
Rafael
--

From: Eric W. Biederman
Date: Wednesday, March 12, 2008 - 5:33 pm

Why?  I guess I don't see why we care what power state the devices are in.
Especially since we should be able to quickly save the image.

We need to disconnect the drivers from the hardware yes.  So filesystems
still work and applications that do direct hardware access still work
and don't need to reopen their connections.

I'm leery of low power states as they don't always work, and bringing
low power states seems to confuse hibernation to disk with suspend to

Reasonable.  In general the problem is much easier if we don't store
the hibernation image in a filesystem or partition that the rest of
the system is using.  That way we avoid inconsistencies.

Eric
--

From: Rafael J. Wysocki
Date: Thursday, March 13, 2008 - 10:03 am

From the ACPI compliance point of view it's better to do it this way.  We need
to put the devices into low power states anyway before "powering off" the
system and we won't need to touch them for the second time if we do that
in advance.

Still, it would be sufficient if we disconnected the drivers from the hardware
and thus prevented applications from accessing that hardware.

Thanks,
Rafael
--

From: Eric W. Biederman
Date: Thursday, March 13, 2008 - 4:07 pm

Interesting.  From a kexec jump where we exit the kernel and then return
to it seem all that is required.

I will have to look at the ACPI case.  That seems to be the lynch pin

My gut feeling is that except for a handful of drivers we could even
get away with simply implementing hot unplug and hot replug.  Disks
are the big exception here.

Which suggests to me that it is at least possible that the methods we
want for a kexec jump hibernation may be different from an in-kernel
hibernation and quite possibly are easier to implement.

Eric
--

From: Rafael J. Wysocki
Date: Thursday, March 13, 2008 - 6:31 pm

Yes and that's because ACPI regards hibernation as a _sleep_ state, something
more like S3 (suspend to RAM) than S5 (power off).

In fact even now we're doing things that are strange from the ACPI standpoint.
For example, we should really execute _PTS once during the entire transition
and we shouldn't call _WAK after we've created the image.  We're doing that

I'm not sure about the "easier" part, quite frankly.  Also, with our current
ordering of code the in-kernel hibernation will need the same callbacks
as the kexec-based thing.  However, with the in-kernel approach we can
attempt (in the future) to be more ACPI compliant, so to speak, but with the
kexec-based approach that won't be possible.

Whether it's a good idea to follow ACPI, as far as hibernation is concerned, is
a separate question, but IMO we won't be able to answer it without _lots_ of
testing on vaious BIOS/firmware configurations.  Our experience so far
indicates that at least some BIOSes expect us to follow ACPI and misbehave
otherwise, so for those systems there should be an "ACPI way" available.
[Others just don't work well if we try to follow ACPI and those may be handled
using the kexec-based approach, but that doesn't mean that we can just ignore
the ACPI compliance issue, at least for now.]

Thanks,
Rafael
--

From: Pavel Machek
Date: Wednesday, March 12, 2008 - 1:57 am

Well, traditionaly it is 'A saves B to disk' (like bootloader saves
kernel&userspace). In swsusp we have 'kernel saves itself'... which


Fortunately its not the only one :-).

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Wednesday, April 9, 2008 - 2:34 am

Eric, can we get some reviewing/merging going on? Patches seem pretty
clean to me, and I do not think holding them outside mainline will

Acked-by: Pavel Machek <pavel@suse.cz>

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Vivek Goyal
Date: Wednesday, April 9, 2008 - 5:30 am

Pavel, I think Huang is working on v10 based on feedback last time. So
probably we can get into line by line review in next version.

Thanks
Vivek
--

From: Vivek Goyal
Date: Wednesday, May 14, 2008 - 9:03 am

Hi Huang,

Ok, after a long time, I am back to testing and reviewing this patch.



How do I got back to original kernel without loading dump.elf. I mean,
original kernel is already in memory and I don't have to first save
it to disk and then reload back. Is there a way to do it? If not, then
we need to modify kexec-tools to support that.

Something like

kexec --entry=<entry point>, should tell kexec that kernel is already
loaded. Just do the bit to set the entry point properly.

Thanks
Vivek
--

From: Vivek Goyal
Date: Wednesday, May 14, 2008 - 10:49 am

Never mind. I found it. Following worked for me for returning back to
original kernel.

kexec --load-jump-back-helper --entry=<entry point>

Just wondering if "--load-jump-back-helper" should be an explicit option
or kexec should silently assume it if no "-l" or "-p" is given.

Thanks
Vivek
--

From: Vivek Goyal
Date: Wednesday, May 14, 2008 - 1:52 pm

Hi Huang,

Ok, I have done some testing on this patch. Currently I have just
tested switching back and forth between two kernels and it is working for
me.

Just that I had to put LAPIC and IOAPIC in legacy mode for it to work. Few
comments/questions are inline.


Upon re-entering the kernel, what happens to GDT table? So gdtr will be
pointing to GDT of other kernel (which is not there as pages have been
swapped)? Do we need to reload the gdtr upon re-entering the kernel.


What does above piece of code do? Looks like redundant for switching
between the kernels? After call *%edx, we never return here. Instead
we come back to "kexec_jump_back_entry"?



I am not sure what is helper jump back entry? I understand that you 
are using %edi to pass around entry point between two kernels. Can
you please shed some more light on this?

Thanks
Vivek
--

From: Huang, Ying
Date: Wednesday, May 14, 2008 - 7:32 pm

On Wed, 2008-05-14 at 16:52 -0400, Vivek Goyal wrote:

Thanks.


After re-entering the kernel and returning from machine_kexec,
restore_processor_state() is called, where the GDTR and some other CPU

For switching between the kernels, this is redundant. Originally another
feature of kexec jump is to call some code in physical mode. This is
used to provide a C ABI to called code.

Now, Eric suggests to use a C ABI compatible mode to pass the jump back
entry point too, that is, use the return address on stack instead of %
edi. I think that is reasonable. Maybe we can revise this code to be
compatible with C ABI and provide a convenient interface for both kernel

Helper jump back entry is used to provide a C ABI to some physical mode
code other than kernel. It is the above redundant code.

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Thursday, May 15, 2008 - 1:09 pm

Hi Huang,

Ok, You want to make BIOS calls. We already do that using vm86 mode and
use bios real mode interrupts. So why do we need this interface? Or, IOW,
how is this interface better?

Do you have something in mind where/how are you going to use it?

Thanks
Vivek
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 6:48 pm

On Thu, 2008-05-15 at 16:09 -0400, Vivek Goyal wrote:

It can call code in 32-bit physical mode in addition to real mode. So It
can be used to call EFI runtime service, especially call EFI 64 runtime
service under 32-bit kernel or vice versa.

The main purpose of kexec jump is for hibernation. But I think if the
effort is small, why not support general 32-bit physical mode code call
at same time.

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Thursday, May 15, 2008 - 6:51 pm

In general what's the environment requirements for EFI runtime 
services? I mean, just that processor should be in protected mode with
paging disabled or one need to stop all other cpus and devices and then make
the call (as we are doing in this case?). 

Thanks
Vivek
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 7:08 pm

Put processor in protected mode with paging disabled is sufficient. In
one of previous kexec jump versions, I provide some option to choose the
state saved (whether stop other cpus, whether stop devices).

I agree that now we should focus on kexec based hibernation. But I think
it is reasonable to keep the possibility with minimal effort.

Best Regards,
Huang Ying

--

From: Pavel Machek
Date: Friday, May 16, 2008 - 5:13 am

I believe we should focus on kexecing kernels, first.

Only way to prove the effort is small is by having small followup
patch, and that needs the two patches separated...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Huang, Ying
Date: Wednesday, May 14, 2008 - 10:41 pm

Hi, Vivek,

On Wed, 2008-05-14 at 16:52 -0400, Vivek Goyal wrote:

It seems that for LAPIC and IOAPIC, there is
lapic_suspend()/lapic_resume() and ioapic_suspend()/ioapic_resume(),
which will be called before/after kexec jump through
device_power_down()/device_power_up(). So, the mechanism for
LAPIC/IOAPIC is there, we may need to check the corresponding
implementation.

Best Regards,
Huang Ying
--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 11:42 am

And if you start with the device shutdown path the code is already
there and working.

Eric
--

From: Vivek Goyal
Date: Thursday, May 15, 2008 - 5:51 pm

ioapic_suspend() is not putting APICs in Legacy mode and that's why
we are seeing the issue. It only saves the IOAPIC routing table entries
and these entries are restored during ioapic_resume().

But I think somebody has to put APICs in legacy mode for normal 
hibernation also. Not sure who does it. May be BIOS, so that during
resume, second kernel can get the timer interrupts.

Thanks
Vivek
--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 6:35 pm

I doubt anything cares in the suspend to ram case. There should just
be a small BIOS trampoline to get back to linux when the processor
restarts.  And you don't need interrupts for any of that. 

Eric
--

From: Huang, Ying
Date: Thursday, May 15, 2008 - 6:55 pm

As far as I know, in suspend to ram, interrupt is used as waking up
event, such as, keyboard interrupt.

Best Regards,
Huang Ying
--

From: Huang, Ying
Date: Tuesday, May 27, 2008 - 12:27 am

As for IOAPIC legacy mode, is it related to the following code which set
the routing table entry for i8259?


void disable_IO_APIC(void)
{
        /*
         * Clear the IO-APIC before rebooting:
         */
        clear_IO_APIC();

        /*
         * If the i8259 is routed through an IOAPIC
         * Put that IOAPIC in virtual wire mode
         * so legacy interrupts can be delivered.
         */
        if (ioapic_i8259.pin != -1) {
                struct IO_APIC_route_entry entry;

                memset(&entry, 0, sizeof(entry));
                entry.mask            = 0; /* Enabled */
                entry.trigger         = 0; /* Edge */
                entry.irr             = 0;
                entry.polarity        = 0; /* High */
                entry.delivery_status = 0;
                entry.dest_mode       = 0; /* Physical */
                entry.delivery_mode   = dest_ExtINT; /* ExtInt */
                entry.vector          = 0;
                entry.dest.physical.physical_dest =
                                        GET_APIC_ID(apic_read(APIC_ID));

                /*
                 * Add it to the IO-APIC irq-routing table:
                 */
                ioapic_write_entry(ioapic_i8259.apic, ioapic_i8259.pin,
entry);
        }
        disconnect_bsp_APIC(ioapic_i8259.pin != -1);
}


But, because IOAPIC may need to be in original state during
suspend/resume, so it is not appropriate to call disable_IO_APIC() in
ioapic_suspend(). So I think we can call disable_IO_APIC() in new
hibernation/restore callback.

Am I right?

Best Regards,
Huang Ying

--

From: Vivek Goyal
Date: Tuesday, May 27, 2008 - 3:15 pm

My hunch is suspend/resume will still work if we put this call in
ioapic_suspend() but I would not recommend that. suspend/resume does
not need to put IOAPIC in legacy mode.
  
I am not sure what is "new hibernation/restore callback"? Are you
referring to new patches from Rafel?

I think this issue is specifc to kexec and kjump so probably we should
not tweaking any suspend/resume related bit.

How about calling disable_IO_APIC() in kexec_jump()? We can probably even
optimize it by calling it only when we are transitioning into new image
for the first time and not for subsquent transitions (by keeping some kind of
count in kimage). This is little hackish but, should work...

Thanks
Vivek
--

From: Huang, Ying
Date: Tuesday, May 27, 2008 - 6:35 pm

On Tue, 2008-05-27 at 18:15 -0400, Vivek Goyal wrote:

Yes. Rafel has a new patch to separate suspend and hibernation device
call backs.

Yes. This issue is kexec/kjump specific. We can call it in kexec_jump().
Maybe we also need call something other in native_machine_shutdown()?

BTW: I have a new version -v10: http://lkml.org/lkml/2008/5/22/106, do
you have time to review it?

Best Regards,
Huang Ying
--

From: Eric W. Biederman
Date: Wednesday, May 14, 2008 - 3:30 pm

Tricky, and I expect unnecessary.

Ugh.  No.  Not sharing the shutdown methods with reboot and
the normal kexec path looks like a recipe for failure to me.

This looks like where we really need to have the conversation.
What methods do we use to shutdown the system.

My take on the situation is this.  For proper handling we
need driver device_detach and device_reattach methods.

With the following semantics.  The device_detach methods
will disable DMA and place the hardware in a sane state
from which the device driver can reclaim and reinitialize it,
but the hardware will not be touched.

device_reattach reattaches the driver to the hardware.

So looking at this patch I see two very productive directions
we can go.
1) A patch that just fixes up the kexec infrastructure code
   so it implements the swap page and provides the kernel
   reentry point.  And doesn't handle the upper layer
   user interface portion.

2) A patch that renames device_shutdown to device_detach.
   And starts implementing the driver hooks needed from
   a resumable kexec.

Then we have the question what do we do with devices in the
kernel that don't have a device_reattach method, when we
expect to come back from a kexec.  The two choices are:
(a) fail the operations before we commit to anything.
(b) hotunplug/hotreplug the device.

With respect to device methods.  I don't think any of
the current power saving methods make sense.  Certainly
nothing that prepares the way for using weird ACPI states.

I don't think there is not enough difference between
device_detach and device_shutdown for us to maintain two
separate methods, and that seems to place an unreasonable
maintenance burden on device driver developers.

Eric
--

From: Rafael J. Wysocki
Date: Wednesday, May 14, 2008 - 4:55 pm

Well, it looks like we do similar things concurrently.  Please have a look 
here: http://kerneltrap.org/Linux/Separating_Suspend_and_Hibernation

Similar patches are in the Greg's tree already.

Thanks,
Rafael
--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 3:03 pm

Yes.  Part of the reason I wanted to separate these two conversations

Taking a look.

I just can't get past the fact in that the only reason hibernation can
not use the widely implemented and tested probe/remove is because of
filesystems on block devices, and that you are proposing to add 4
methods for each and every driver to handle that case, when they
don't need ANYTHING!

I wonder how hard teaching the upper layers to deal with
hotplug/remove is?

The more I look at this the more I get the impression that
hibernation and suspend should be solved in separate patches.  I'm
not at all convinced that is what is good for the goose is good for
the gander for things like your prepare method.

Hibernation seems to be an extreme case of hotplug.

Suspend seems to be just an extreme case of putting unused
devices in low power state.

....


I don't like the fact that these methods are power management specific.
How should this impact the greater kernel ecosystem.

+ * The externally visible transitions are handled with the help of the following
+ * callbacks included in this structure:
+ *
+ * @prepare: Prepare the device for the upcoming transition, but do NOT change
+ *	its hardware state.  Prevent new children of the device from being
+ *	registered after @prepare() returns (the driver's subsystem and
+ *	generally the rest of the kernel is supposed to prevent new calls to the
+ *	probe method from being made too once @prepare() has succeeded).  If
+ *	@prepare() detects a situation it cannot handle (e.g. registration of a
+ *	child already in progress), it may return -EAGAIN, so that the PM core
+ *	can execute it once again (e.g. after the new child has been registered)
+ *	to recover from the race condition.  This method is executed for all
+ *	kinds of suspend transitions and is followed by one of the suspend
+ *	callbacks: @suspend(), @freeze(), or @poweroff().
+ *	The PM core executes @prepare() for all devices before starting to
+ *	execute suspend callbacks ...
From: Rafael J. Wysocki
Date: Thursday, May 15, 2008 - 4:20 pm

Why exactly do you think that removing()/probing() devices just for creating
a hibernation image is a good idea?


This was discussed a lot with people who had exactly opposite opinions.




The names have been discussed either and I don't intend to change them now.


For many drivers @suspend will be equivalent to @freeze + @poweroff probably.

Also, @restore is not the same as @resume, because @restore cannot assume

And you are wrong.


No, I don't think so.  I don't want the driver to detach, but to quiesce the

That I can agree with, if I understood you correctly. :-)

Still, having more specialized callbacks is not generally bad IMO, they
can reuse the code just fine.

Thanks,
Rafael
--

From: Pavel Machek
Date: Friday, May 16, 2008 - 5:18 am

It looked _too_ hard when I was looking... at least if we are thinking
off "keep the filesystem mounted over unplug-replug".

Or do you have something else in mind?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Alan Stern
Date: Friday, May 16, 2008 - 7:20 am

You are mixing together several distinct concepts: powering-down

Isn't that exactly what Rafael has been doing?  Why do you think the 


To a large extent that is true.  But there is more to it.

Normally, when a device is put in a low-power state while the system is 
running, it's expected that an I/O request will cause the device to 
return to full power.  But during suspend things don't work that way; 
instead I/O requests are supposed to be blocked and the device is 
supposed to remain at low power.

There are other requirements too.  While the system is running, more or
less arbitrary devices can go to low power at more or less arbitrary
times (as appropriate, of course).  But during suspend, the PM core has
to arrange that _all_ devices go to low power.  This means, among other
things, that drivers have to be prevented from registering new devices
while the transition is taking place.  (Not to mention the races 
involved in registering a new child device at the same time as its 

I think you have missed the point of these methods.  What they do is
unrelated to pausing or unpausing; their main purpose is to prevent and

Why do we need this?  Why can't user-space-facing activities be stopped 

This isn't the same thing at all.  @freeze doesn't detach the driver 

?  Why are you worried about what the device sees?  The device can't 
even tell the difference between a detach and a period of time with no 
I/O.

Alan Stern

--

From: Huang, Ying
Date: Wednesday, May 14, 2008 - 6:42 pm

On Wed, 2008-05-14 at 15:30 -0700, Eric W. Biederman wrote:

OK, I will check this. Maybe we can move CPU state saving code into
relocate_new_kernel.


Yes. Current device PM callback is not suitable for hibernation (kexec
based or original). I think we can collaborate with Rafael J. Wysocki on

OK. I can separate the patch into two patches.

Best Regards,
Huang Ying

--

From: Rafael J. Wysocki
Date: Thursday, May 15, 2008 - 12:05 pm

Thanks, I'm also open for collaboration.  There will be a lot of work to do
related to the new callbacks, so any contribution is certainly welcome.

Thanks,
Rafael
--

From: Alan Stern
Date: Thursday, May 15, 2008 - 7:14 am

How would these differ from the already-existing remove and probe 
methods?

Alan Stern

--

From: Eric W. Biederman
Date: Thursday, May 15, 2008 - 1:48 pm

Honestly I would like for them not to, and they should be
proper factors of the remove and probe methods.

However we have a fundamental gotcha that we need to handle.
Logical abstractions on physical devices.

i.e.  How do we handle the case of a filesystem on a block
      device, when we remove the block device and then read it.

We have two choices.
1) We go through the pain of teaching the upper layers in the
   kernel of how to deal with hotplug and then we are sane
   when someone removes a usb stick accidentally before
   unmounting it and then reinserts the usb stick.

2) Teach the drivers how to do just the lower have of hotplug/remove.
   In which case with the driver still present and presenting it's
   upper layer queues we have the driver relinquish it's hardware
   and then later check to see if it's hardware is still present
   and reinitialize it.

I don't know if anyone has looked at moving this to an upper layer.
Definitely a question worth asking.  The simpler we can make this
for driver authors the better.  Especially as that will make
the drivers more maintainable long term.

Eric
--

From: Alan Stern
Date: Thursday, May 15, 2008 - 2:07 pm

The filesystem code should then receive an error for any I/O operating 
it tries to carry out.  That's what happens when you unplug a USB flash 

I don't understand.  Suppose you teach the filesystem layer about 
hot-unplugging.  So the user removes a USB stick before unmounting it, 
and when the filesystem tries to access the media it learns that the 
device is gone -- and the filesystem is gone with it.  How is that any 
better than getting an I/O error (apart from not filling the system log 

That's how usb-storage works in 2.4.  Linus told us to change it,
probably because there was no mechanism for removing the driver's data
structures after a device was unplugged.  They had to be kept around

Maybe you're talking about adding some sort of Persistent-Device
feature to the LVM?

In an event, I'm not sure why you brought all this up.  How is it 
relevant to kexec or kexec jump?

Are you worried that there needs to be a way to tell drivers to quiesce 
their devices before doing the kexec?

Alan Stern

--

Previous thread: Re: [PATCH 05/11] tifm: fix the MemoryStick host fifo handling code by Alex Dubov on Wednesday, March 5, 2008 - 7:30 pm. (1 message)

Next thread: 你想清貨嗎?快進來kRc0 by tnmccb rygci on Wednesday, March 5, 2008 - 8:25 pm. (1 message)