This patch implements the functionality of jumping between the kexeced kernel and the original kernel. A new reboot command named LINUX_REBOOT_CMD_KJUMP is defined to trigger the jumping to (executing) the new kernel and jumping back to the original kernel. To support jumping between two kernels, before jumping to (executing) the new kernel and jumping back to the original kernel, the devices are put into quiescent state (to be fully implemented), and the state of devices and CPU is saved. After jumping back from kexeced kernel and jumping to the new kernel, the state of devices and CPU are restored accordingly. The devices/CPU state save/restore code of software suspend is called to implement corresponding function. To support jumping without preserving memory. One shadow backup page is allocated for each page used by new (kexeced) kernel. When do kexec_load, the image of new kernel is loaded into shadow pages, and before executing, the original pages and the shadow pages are swapped, so the contents of original pages are backuped. Before jumping to the new (kexeced) kernel and after jumping back to the original kernel, the original pages and the shadow pages are swapped too. A jump back protocol is defined and documented. Known issues - A field is added to Linux kernel real-mode header. This is temporary, and should be replaced after the 32-bit boot protocol and setup data patches are accepted. - The suspend method of device is used to put device in quiescent state. But if the ACPI is enabled this will also put devices into low power state, which prevent the new kernel from booting. So, the ACPI must be disabled both in original kernel and kexeced kernel. This is planed to be resolved after the suspend method and hibernate method is separated for device as proposed earlier in the LKML. - The NX (none executable) bit should be turned off for the control page if available. ChangeLog -- 2007/9/19 -- 1. Two reboot command are merge ...
Seems like good enough for -mm to me. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Hi Andrew. Andrew, if I recall correctly, you said a while ago that you didn't want another hibernation implementation in the vanilla kernel. If you're going to consider merging this kexec code, will you also please consider merging TuxOnIce? Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. -
The theory is that kexec-based hibernation will mainly use preexisting kexec code and will permit us to delete the existing hibernation implementation. That's different from replacing it. -
Hi. TuxOnIce doesn't remove the existing implementation either. It can transparently replace it, but you can enable/disable that at compile time. Regards, Nigel -- Nigel Cunningham Christian Reformed Church of Cobden 103 Curdie Street, Cobden 3266, Victoria, Australia Ph. +61 3 5595 1185 / +61 417 100 574 Communal Worship: 11 am Sunday. -
Right. So we end up with two implementations in-tree. Whereas kexec-based-hibernation leads us to having zero implementations in-tree. See, it's different. -
Hi. That's not true. Kexec will itself be an implementation, otherwise you'd end up with people screaming about no hibernation support. And it won't result in the complete removal of the existing hibernation code from the kernel. At the very least, it's going to want the kernel being hibernated to have an interface by which it can find out which pages need to be saved. I wouldn't be surprised if it also ends up with an interface in which the kernel being hibernated tells it what bdev/sectors in which to save the image as well (otherwise you're going to need a dedicated, otherwise untouched partition exclusively for the kexec'd kernel to use), or what network settings to use if it wants to try to save the image to a network storage device. On top of that, there are all the issues related to device reinitialisation and so on, and it looks like there's greatly increased pain for users wanting to configure this new implementation. Kexec is by no means proven to be the panacea for all the issues. Regards, Nigel -- Nigel Cunningham Pastor Christian Reformed Church of Cobden Victoria, Australia +61 3 5595 1185 -
This has been done by kexec/kdump guys. There is a makedumpfile utility and vmcoreinfo kernel mechanism to implement this. We can just reuse the These can be done in user space. The image writing will be done in user Yes. Device reinitialisation is needed. But all in all, kexec based Configuration is a problem, we will work on it. But, because it is based on kexec/kdump instead of starting from scratch, the duplicated part between hibernation and kexec/kdump can be eliminated. Best Regards, Huang Ying -
Hi. You've already said that you are currently saving all pages. How are you going to avoid saving free pages if you don't get the information from the kernel That only complicates things more. Now you need to get the information on where to save the image from the kernel being saved, then transfer it to userspace after switching to the kexec kernel. That's more kernel code, not Regards, Nigel -- Nigel, Michelle and Alisdair Cunningham 5 Mitchell Street Cobden 3266 Victoria, Australia -
I have not tried "makedumpfile". The "makedumpfile" avoids saving free pages through checking the "mem_map" of the original kernel. I think there is nothing prevent it been used for kexec based hibernation image writing. This is an example of duplicated effort between kexec/kdump and original hibernation implementation. Both kexec/kdump and hibernation need to save memory image without saving the free pages. This can be done once This is fairly simple in fact. For example, you can specify the bdev/sectors in kernel command line when do kexec load "kexec -l <...> --append='...'", then the image writing system can get it through Best Regards, Huang Ying -
Hi. Sounds doable, as long as you can cope with long command lines (which shouldn't be a biggie). (If you've got a swapfile or parts of a swap partition already in use, it can be quite fragmented). Andrew, you're seeing that it really doesn't mean the removal of all hibernation code from the kernel being suspended, aren't you? (And if the kexec'd kernel is the same binary, then there's more code again). Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. -
Hmm. This is an interesting problem. Sharing a swap file or a swap partition with the actual swap of user space pages does seem to be a limitation of this approach. Although the fact that it is simple to write to a separate file may More binary size yes not more code to maintain. As for the rest the current implementation is small enough and allows for enough beyond hibernation I think it makes sense to eventually merge assuming a good clean implementation can be achieved. Eric -
I'm not sure how you'd write it to a separate file. Notice that kjump kernel may not mount journalling filesystems, not even read-only. (Ext3 replays journal in that case). You could pass block numbers from the original kernel... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
The ext3 thing is a bug, the case for which I don't think has been
adequately explained to the ext[34] folks. There should be at least a
no_replay mount flag available, or something. It has ramifications
for more than just hibernation.
And yeah, I'm gonna bring up the swap files thing again. If you
can hibernate to a swap file, you can hibernate to a dedicated
hibernation file, and vice versa.
If you can't hibernate to a swap file, then swap files are
effectively unsupported for any system you might want to hibernate.
<handwave> I wonder what embedded folks would think about that
</handwave>.
But, in my ignorance, I'm not sure even fixing the ext3 bug will
guarantee you consistent metadata so that you can handle a
swap/hibernate file. You can do a sync(), but how do you make that
not race against running processes without the freezer, or blkdev
snapshots?
I guess uswsusp and the-patch-previously-known-as-suspend2 handle
this somehow, though.
(It's that same ignorance that has me waiting for someone with
established credit with kernel people to make that argument for the
ext3 bug, so I can hang my own reasons for thinking that it's bad off
of theirs).
--
Joseph Fannin
jfannin@gmail.com
-
Hi. I haven't looked at swsusp support, but TuxOnIce handles all storage (swap partitions, swap files and ordinary files) by first allocating swap (if we're using swap), then bmapping the storage we're going to use. After that, we can freeze filesystems and processes with impunity. The allocated storage is then viewed as just a collection of bdevs, each with an ordered chain of extents defining which blocks we're going to read/write - a series of tapes if you like. In the image header, we store dev_ts and the block chains, together with the configuration information. As long as the same bdevs are configured at boot time prior to the echo > /sys/power/resume, we're in business. Filesystems don't need to be mounted because we don't use filesystem code anyway. (LVM etc does though in so far as it's needed to make the dev_t match the device again). This matches with what you said above about hibernating to swap files and dedicated hibernation files - TuxOnIce uses exactly the same code to do the i/o to both; the variation is in the code to recognise the image header and allocate/free/bmap storage. <not a filesystem expert> Personally, I don't think ext[34] is broken. If there's data being left in the journal that will need replaying, then mounting without replaying the journal sounds wrong. Perhaps you should instead be arguing that nothing should be left in the journal after a filesystem freeze. But, of course, current code isn't doing a filesystem freeze (just a process freeze) and the kexec guys want to take even that away. </not a filesystem expert> In short, I agree. AFAICS, you need both the process freezer and filesystem freezing to make this thing fly properly. Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. -
The image-writing kernel of kexec based hibernation run in a controlled way. It is not used by normal user, so only really necessary process need to be run. For example, it is possible that there is only one user process -- the image-writing process running in image-writing kernel. So, no freezer or blkdev snapshot is needed. Best Regards, Huang Ying -
Hi. You're thinking of the wrong kernel - we were talking about prior to switching to the kexec'd kernel while suspending. Regards, Nigel -- See http://www.tuxonice.net for Howtos, FAQs, mailing lists, wiki and bugzilla info. -
I hope you take into account encrypted swap configuration. Currently all three suspend implementations support using encrypted swap in order to suspend/resume. A configuration which forces the user to remap encryption on the kexec kernel during suspend is not valid. Best Regards, Alon Bar-Lev. -
Maybe, maybe not, dunno. That's why we haven't merged it yet. If it ends up being no good, we won't merge it! -
There needs to be an implementation of hibernation based on kexec with That interface should be running kernel -> user space -> target kernel. initramfs. We already seem to have that interface. And distros I agree. I'm still not quite convinced it will do a satisfactory job. But I think it does make sense to implement a general kexec with return and see if that can reasonably be used for handling hibernation issues. If done cleanly and with care the implementation won't be hibernation specific. Frankly this looks like the best way I can see to implement a general mechanism for calling silly firmware/BIOS/EFI services after we have a kernel up and running. It's a little bit like allowing X to call iopl(3) and do inb/outb directly. The configuration issues you raise pretty much exist for kexec on panic, and they seem to be being resolved for that case in a reasonable way. I do agree that the current kexec+return effort seems to be one of those unfortunate cases where we give every mechanism in the kernel to do something in user space and then no one actually implements the user space. That doesn't do any one any good. For hibernation we don't have the absolute need to step outside of the current kernel that we do in the kexec on panic approach. However we have this practical fight about mechanism and policy, and kexec with return has this seductive allure that it appears to be the minimal necessary mechanism in the kernel. No one has yet attacked the hard problem of coming up with separate hibernate methods for drivers. This should be the hard part of the puzzle, and the recurring work from a kernel maintenance point of view. There is some reason to hope that things will be a maintenance will be a little simpler because you can get at all of the distinct pieces of the puzzle. Currently kexec with return appears to require the minimal amount of mechanism in the kernel and leaves the policy to someplace else, plus the code is not hibernation specific. ...
Well, I've been playing a bit with that for some time, but it's not easy by any means. In short, I'm seeing some problems related to the handling of ACPI that seem to shatter the entire idea of having separate hibernate methods, at least as far as ACPI systems are concerned. Greetings, Rafael -
So sadly to hear this. Can you details it a little? Or a link? Best Regards, Huang Ying -
Well, the problem is that apparently some systems (eg. my HP nx6325) expect us to execute the _PTS ACPI global control method before creating the image _and_ to execute acpi_enter_sleep_state(ACPI_STATE_S4) in order to finally put the system into the sleep state. In particular, on nx6325, if we don't do that, then after the restore the status of the AC power will not be reported correctly (and if you replace the battery while in the sleep state, the battery status will not be updated correctly after the restore). Similar issues have been reported for other machines. Now, the ACPI specification requires us to put devices into low power states before executing _PTS and that's exactly what we're doing before a suspend to RAM. Thus, it seems that in general we need to do the same for hibernation on ACPI systems. Greetings, Rafael -
Then, is it possible to separate device quiesce from device suspend. Perhaps not for swsusp, but for kexec based hibernation? Best Regards, Huang Ying -
It surely is possible, but I'm not sure if it's going to be useful. I mean, if we need to do exactly the same thing before a suspend to RAM and before a hibernation (ie. to put devices into low power states), why would we Frankly, I don't know. Generally, changing the way in which device drivers handle suspend (to RAM) and hibernation is a huge task. After considering this issue for some time I think that we really should start from hardening suspend (to RAM) so that it doesn't need the freezer any more, because _that_ would require us to change the suspend-related drivers' callbacks anyway. When we are sure how we are going to eliminate the freezer from suspend (to RAM), we'll know how that affects hibernation and what to do about it. Greetings, Rafael -
Suppose that instead of using ACPI S4 state at all, you instead just power off. Yes, you'll lose wakeup event functionality, and flashy LEDs, but doesn't this take care of the problem? The firmware shouldn't see the hibernate as anything other than a shutdown and reboot. ACPI should be initialized normally when resuming, which should take care of getting AC power status reported properly. This should be the behavior, anyway, on the many systems that do not It seems that if ACPI S4 is going to be used, Switching to low power state is something that should be done only immediately before entering that state (i.e. after the image has already been saved). In particular, it should not be done just before the atomic copy. It is true that (during resume) after the atomic copy snapshot is restored, drivers will need to be prepared (i.e. have saved whatever information is necessary) to _resume_ devices from the low power state, but that does not mean they have to actually be put into that low power state before the copy is made. I agree that for the kexec implementation there may be additional issues. For swsusp, uswsusp, and tuxonice, though, I don't see why there should be a problem. I think that, as was recognized before, all of the issues are resolved by properly considering exactly what each callback should do and when it should be called. The problems stem from ambiguous specifications, or trying to use the same callback for two different purposes or in two different cases. Let me know if I'm mistaken. -- Jeremy Maitin-Shepard -
See above. :-) Greetings, Rafael -
One gets the impression that the hibernation image includes a memory area used by the firmware. That could explain why devices need to be in a low-power state when the image is created -- so that when the image is restored, the firmware doesn't get confused about the device states. It would also explain why the firmware sees resume-from-power-off-hibernation as different from a regular reboot: because its data area gets overwritten as part of the resume. In reality it's probably more complicated than this, with weird interactions between the firmware and the various ACPI methods. Nevertheless, the main idea seems valid. Alan Stern -
I guess so, but I'm not sure. The ACPI NVS area is explicitly marked as reserved and we don't save it. On x86_64 we don't save any memory areas marked as reserved and yet the above happens. Greetings, Rafael -
"Rafael J. Wysocki" <rjw@sisk.pl> writes: I think you have mentioned before, though, that ACPI is first initialized by the boot kernel, before it is later initialized by resuming kernel. This could well be the source of the problem. In particular, isn't it the case that you also switch the devices to low power mode before resuming? -- Jeremy Maitin-Shepard -
No, it's not. I have tested that too with an ACPI-less boot kernel. Greetings, Rafael -
Well, it seems that there just must be some other bug. I would define anything that differs between the post-resume initialization of ACPI from the normal boot initialization of ACPI as a bug. If the interaction with the hardware is the same, then the behavior will be the same. -- Jeremy Maitin-Shepard -
The ACPI platform firmware is allowed to preserve information accross the hibernation-resume cycle, so this need not be the same. Greetings, Rafael -
All of my comments related to the case where S4 is not being used (instead the system is just powered off normally), and a boot kernel that does not initialize ACPI is used. In that case, the ACPI platform firmware should not be able to distinguish a normal boot from a resume from hibernation. -- Jeremy Maitin-Shepard -
I think that in order for this to work, there would need to be some ABI whereby the resume-ing kernel can pass its entire ACPI state and a bunch of other ACPI-related device details to the resume-ed kernel, which I believe it does not do at the moment. I believe that what causes problems is the ACPI state data that the kernel stores is *different* between identical sequential boots, especially when you add/remove/replace batteries, AC, etc. Since we currently throw away most of that in-kernel ACPI interpreter state data when we load the to-be-resumed image and replace it with the state from the previous boot it looks to the ACPI code and firmware like our system's hardware magically changed behind its back. The result is that the ACPI and firmware code is justifiably confused (although probably it should be more idempotent to begin with). There's 2 potential solutions: 1) Formalize and copy a *lot* of ACPI state from the resume-ing kernel to the resume-ed kernel. 2) Properly call the ACPI S4 methods in the proper order Neither one is particularly easy or particularly pleasant, especially given all the vendor bugs in this general area. Theoretically we should be able to do both, since one will be more reliable than the other on different systems depending on what kinds of firmware bugs they have. Cheers, Kyle Moffett -
Hi. That's certainly possible. We already pass a very small amount of data between the boot and resuming kernels at the moment, and it's done quite simply - by putting the variables we want to 'transfer' in a nosave page/section. I could conceive of a scheme wherein this was extended for driver data. Since the memory needed would depend on the drivers loaded, it would probably require that the space be allocated when hibernating, and the locations of structures be stored in the image header and then drivers notified of the locations to ... that said, I don't think the above should be necessary in most cases. I believe we're already calling the ACPI S4 methods in the proper order. If I understood correctly, Rafael put a lot of effort into learning what that was, Regards, Nigel -
Well, if the boot and image kernels are different, which is now possible on x86_64 with some recent patches (currently in -mm), the nosave trick won't work. Still, I don't think we need to pass anything from the boot to the image kernel. Moreover, we shouldn't do that, IMO (arguably, the boot kernel Yes, I did, but I can be wrong nevertheless. ;-) Greetings, Rafael -
I guess we should remove the nosave.... at least from x86-64. If someone tries to use it, he'll get a nasty surprise. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Agreed. I'll try to prepare a patch for that when I have a bit of time. Greetings, Rafael -
In fact we don't need to do this. The solution is not to touch ACPI in the boot kernel (ie. the one that loads the image) and pass control to the image kernel. This is how it's supposed to work according to the spec, more or less (well, there are some ugly details Rather the ACPI state data that the platform firmware stores may be different, depending on whether you enter S4 or S5 during "power off" and that determines the interactions between the kernel and the firmware after the next boot. Greetings, Rafael -
First of all, we will need to make the resumed kernel throw away *ALL* of its ACPI state on S5 and completely reinitialize ACPI as though it was booting for the first time on resume. From what I can tell, we "throw away" all the ACPI state in the boot kernel and reinitialize it there, but then the reinitialized state is overwritten with the resumed kernel's state and the two don't always happen to be the same. (Like if a battery got replaced or AC status changed). Umm, I don't see how that can possibly work properly. For a laptop, for example, the restore kernel will need to access the disk, the LCD display, and possibly the AC/battery and current CPU frequency. From what I understand of ACPI, both of the former may need ACPI code to operate properly (Isn't there an ATA taskfile object of some kind?) and the latter two almost definitely need ACPI. Ergo the boot kernel may need to initialize and use ACPI just to run an ATA taskfile so it That's not what he was talking about. The problem discussed was: (A) You hibernate your box, entering S5 (IE: power off) (B) You resume the box and the boot kernel inits all the ACPI stuff. (C) The boot kernel's ACPI state is completely replaced by the resumed kernel's state. (D) Hardware stops working mysteriously because of ACPI problems. The only possible conclusion is that the state between the boot kernel and the resume kernel was *different* and so the device failed because the ACPI state in the resume kernel doesn't match the actual state of the hardware. Cheers, Kyle Moffett -
Yes, if we entered S5 in the last step of the hibernation sequence, the right thing to do would be to make the resumed kernel reinitialize ACPI from Usually it goes like that. Still, you can pass "acpi=off" to the boot kernel, Well, this is not the case on any systems that I have access to, including two quite modern notebooks. Apparently, everything works without ACPI on these machines. Besides, in theory, it's possible to use an "intelligent" boot loader to read I think it's even more complicated. The ACPI state of the resumed kernel has to match whatever is preserved by the platform. Well, my impression is that our current ACPI resume code actually expects the platform to preserve something and if that's missing the devices in question are not handled properly. If that really is the case, there is the question whether we can do something about it in a reasonable way and I can't answer it right now. Besides, I really think that we should use the ACPI S4 state, because machines generally support that. Greetings, Rafael -
FWIW, on all the hardware I have, Windows is able to deal with:
(1) hibernate Windows
(2) run $(OTHER_OS)
(3) resume Windows
... which seems to me to say that Linux is doing it wrong if it can't
handle other ACPI users between hibernate and resume. But maybe
that's just my hardware.
--
Joseph Fannin
jfannin@gmail.com
-
Hi Andrew, Well, I don't quite agree. For now, the kexec-based approach is missing the handling of devices, AFAICS. Namely, it's quite easy to snapshot memory with the help of kexec, but the state of devices gets trashed in the process, so you need some additional code saving the state of devices for you, executed before the kexec. Moreover, on ACPI systems the transition to the S4 sleep state and back to S0 (working state) is more complicated than a system checkpointing, because we are supposed to take the platform firmware into consideration in that case. The more I think about this, the more it seems to me that it just can't be done on top of kexec in a reasonable fashion. Of course, we could avoid handling the ACPI S4, but that would leave some people (including me ;-)) with semi-working hardware after the "restore". I don't think that's generally acceptable in the long run. IMHO, for ACPI systems the way to go is to harden suspend to RAM (with s2ram in place and the graphics adapters specifications from Intel and AMD released we are in a good position to do that) and build the S4 transition mechanism on top of that. It can be done easlily by adapting the current hibernation code, but not on top of kexec (I'm afraid). [Besides, the current hibernation userland interface is used by default by openSUSE and it's also used by quite some Debian users, so we can't drop it overnight and it can't be implemented in a compatible way on top of the kexec-based solution.] -
Hi. Could it be fudged by giving userland a null image and having (say) the first ioctl be one that triggers all the real work (with other ioctls being noops or such like, as appropriate)? Regards, Nigel -
Well, the "suspend" part is probably doable, but I'm afraid of the "resume" one. Greetings, Rafael -
Hi. 'k. I've occasionally thought about trying it, but haven't ever gotten around to actually doing it yet. (I'd like to make TuxOnIce transparently replace both swsusp and uswsusp if I could). Regards, Nigel -
Yes. ACPI is a biggest issue of kexec based hibernation now. I will try to work on that. At least I can prove whether kexec based Best Regards, Huang Ying -
Before replacing existing hibernation implementations, someone should fix kexec for i386 (maybe others?) EFI systems... -
(For the record, I do not think this is going to be hibernation-replacement any time soon. But it is functionality useful for other stuff -- dump memory and continue -- and yes it may be able to do hibernation in the long term. It really comes from the other side of reliability: * swsusp is "if your kernel is perfectly healthy, it will work" while this, coming from kdump is * "if your kernel is not completely trashed, it should work" ...which is why can't use swsusp to do dump memory and continue -- you want to do dumps on "slightly broken" systems. And yes, as a sideeffect it may be able to do hibernation... why not, lets see how it works out). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I generally agree. :-) Greetings, Rafael -
Well this we have an implementation of (it's called shutdown) or does that method not do enough to meet the requirements of hibernation. If at all possible I would like to keep reboot, kexec and kexec+return Makes a reasonable amount of sense. We do need to save whatever state we cannot recover just be reprogramming the hardware. As long as the drivers are built so this is a good place for a At least for now that sounds like a reasonable work around. I don't think we want to merge this code until we have agreed upon how the new device_detach and device_reattach (or whatever we call the That does not sound correct. The current implementation of kexec_load does allocate a source page and give it a destination page and usually those two pages are different. But if our memory allocations happen to return a destination page there we use it directly, making no copy necessary. I think we are talking about the same thing but I'm not certain you have thought about the case where your shadow backup page happens Ok. This sounds like the existing implementation. Except it Yes. Unless we happen to have everything allocated on the same page. Does your code handle that case? I know the generic kexec code will pass lists like that in the proper circumstances. Especially for Bleh. We do need to document the requirements but we don't need a versioned monster. And we don't need to be exposing implementation details in that documentation. In the kexec world /sbin/kexec or another user space caller is responsible for passing information to our callers. To be polite we need to document more but the jump back protocol really should be as if the entry point kexec handed control to did Why don't we have a problem with this in the normal kexec case? It looks like we are doing swap_pages unconditionally so I don't see why we need to versions of this function. This bit really looks wrong. It is for /sbin/kexec to provide this Ok. This looks ...
I think the "device_shutdown" is not enough for hibernation. Because in current implementation of the device shutdown method, "recover" is not considered. For example, for hibernation, the current executing request of device should be delayed or finished before shutdown, and may be re-executing after "recover". So I think another pair of callbacks may There is a thread on LKML about this: http://lkml.org/lkml/2007/4/27/129 My description here has some problem. If the source page (shadow page) is same as the target page, there is no copy or swap. I have thought about that, and current implementation works in this situation too. In original kernel it is a allocated page for kexec, so it will not be used Yes. This is the existing implementation, just a little usage changing. I load all memory area used by kexeced kernel in addition to kernel image. This is done in kexec-tools. So the shadow page is allocated for My code can handle that case. If everything allocated on the same page, just do not swap or swap with itself. The same lists of generic kexec This protocol is mainly for loading the hibernation image from the bootloader directly, not for kexec. An external protocol should be A mechanism should be provided to pass the jump back entry to the kexeced kernel. A kernel command line parameter constructed by Unlike normal kexec, some information need to be read/written from/to the control page both in virtual mode and real mode. The code copied to control page is run both in virtual mode and real mode, so the NX bit should be cleared for the control page. But in normal kexec, code copied NORET_TYPE is accurate here, because the return point is not in This is an interface for external bootloader, not for kexec. And if the version of the original kernel and the kexeced kernel is different, a Maybe this is not needed if the kernel command line mechanism is used to pass jump back entry point. The user space tool can get that through I think the ...
Please the kexec_jump code just be triggered off of a flag in struct kimage. We just need to define an extra flag to sys_kexec_load say KEXEC_RETURNS. Ideally in the long term we would not have to do anything except to accept the flag. Adding a flag makes a nice feature test if you want to see if your kernel supports the extended version of kexec. Until we get the hibernation methods sorted out storing the flag in struct kimage and making the methods that we call conditional feels like a more maintainable interface. Especially since we have to know at kexec image load time what we are going to do with the I understand where you are coming from with this implementation of kexec_jump but it looks like this is one of the big parts of this patch that have not reached their final form. This as everyone knows needs to be device_shutdown or a better hibernation Can't we just catch the noboot cpu's in a mutex. disable_nonboot_cpus is actually impossible to implement 100% reliably with current hardware. But something smp_call_function so we trap them at a specific location and then the equivalent when we come back should be simple. I guess the tricky part is bringing the cpus back up again. I haven't looked at the cpu start up code yet to see if it is generally implementable. I would think so, but I guess Odd. I'm a little surprised that the console is the last Eric -
You mean we use KEXEC_RETURNS when do sys_kexec_load, then use ordinary reboot command LINUX_REBOOT_CMD_KEXEC, which call kexec_jump conditional Yes. I should use xchg(&kexec_image, NULL) as that of other kexec I think this is not very simple. Given that we may jump back from the kernel with SMP turned off, or from bootloader directly. But CPU hotplug Best Regards, Huang Ying -
