The question has been asked before, and the answer is always the same: It's a lot more complicated than it sounds.
Adi Zaimi recently asked on the lkml about the possibilty of allowing for live kernel upgrades, giving the ability to upgrade from one kernel to another without rebooting and interrupting services.
This time, however, Rob Landley responded in a series of very informative emails explaining just how complicated of a prospect it is to perform live kernel upgrades. In response to the question, he first says, "Thought about, yes. At length. That's why it hasn't been done. :)" He then goes on to point out many of the details that complicate such efforts. Rob points out two projects of interest to anyone willing to attempt live upgrades. One project is working on suspending a live kernel to restart it later. The other is working on a "two kernel monte", allowing one to boot from one kernel to another. Rob's full emails follow.
From: Adi Zaimi To: linux-kernel kernel list Subject: kernel upgrade on the fly Date: Tue, 18 Jun 2002 17:21:49 -0400 (EDT) Hi all, has anybody worked or thought about a property to upgrade the kernel while the system is running? ie. with all processes waiting in their queues while the resident-older kernel gets replaced by a newer one. I can see the advantage of such a thing when a server can have the kernel upgraded (major or minor upgrade) without disrupting the ongoing services (ok, maybe a small few-seconds delay). Another instance would be to switch between different kernels in the /boot/ directory (for testing purposes, etc.) without rebooting the machine. A search of the web resulted in no related information to the above so I dont know if such an issue has been raised before. Would anybody else think this to be an interesting property to have for the linux kernel or care to comment on this idea? Cheers, Adi Zaimi Rutgers University From: Rob Landley Subject: Re: kernel upgrade on the fly Date: Tue, 18 Jun 2002 15:37:23 -0400 On Tuesday 18 June 2002 05:21 pm, Adi Zaimi wrote: > Hi all, > > has anybody worked or thought about a property to upgrade the kernel > while the system is running? ie. with all processes waiting in their > queues while the resident-older kernel gets replaced by a newer one. Thought about, yes. At length. That's why it hasn't been done. :) Closest you'll get at the moment is some variant of two kernel monte, I.E. a reboot to a new kernel with all processes offed, but at least without involving the bios. The new swsup infrastructure from pavel machek theoretically lets you freeze the state of your system to disk, so we're a heck of a lot farther ahead then we were. If you want to re-open this can of worms, the only way to go is to start with some combination of these two projects: http://falcon.sch.bme.hu/~seasons/linux/swsusp.html http://sourceforge.net/projects/monte/ That said, the fundamental problem is that when you change kernels, run-time state structures change. Parsing your run-time state from oldvers to feed into newvers can't really be done automatically because your tool wouldn't know what any of the changes MEAN, so you would probably have to write a custom frozen process converter, which would be a pain and a half to debug, to say the least. (And by the time you've got that even half debugged you need to do it for the NEXT kernel...) Of course software suspend theoretically deals with at least some of the device driver issues, so there's a certain amount of handwaving you can do on that end. And migrating hot network connections is something people have in fact done before, although you'll have to ask around about who. (Ask the security nuts, they consider it a bad thing. :) Nothing is impossible for anyone impervious to reason, and you might suprise us (it'd make a heck of a graduate project). Hot migration isn't IMPOSSIBLE, it's just a flipping pain in the ass. But the issue's a bit threadbare in these parts (somewhere between "are we there yet mommy?" and "can I buy a pony?"). Try the swsup mailing list, they might be willing to humor you... (And the people most likely to WANT this feature ("this system never goes down" types) are also the least likely to want to deal with subtle bugs from a bad conversion that don't manifest until a week after the new system comes up when cron goes nuts at 3 am. Of course whether hot migration it's more dangerous to your data than the interaction between Andre's and Martin's egoes in the ATAPI layer is an open question... :) Ahem. Right...) The SANE answer always has been to just schedule some down time for the box. The insane answer involves giving an awful lot of money to Sun or IBM or some such for hot-pluggable backplanes. (How do you swap out THE BACKPLANE? That's an answer nobody seems to have...) Clusters. Migrating tasks in the cluster, potentially similar problem. Look at mosix and the NUMA stuff as well, if you're actually serious about this. You have to reduce a process to its vital data, once all the resources you can peel away from it have been peeled away, swapped out, freed, etc. If you can suspend and save an individual running process to a disk image (just a file in the filesystem), in such a way that it can be individually re-loaded later (by the same kernel), you're halfway there. No, it's not as easy as it sounds. :) > I can see the advantage of such a thing when a server can have the kernel > upgraded (major or minor upgrade) without disrupting the ongoing services > (ok, maybe a small few-seconds delay). Another instance would be to > switch between different kernels in the /boot/ directory (for testing > purposes, etc.) without rebooting the machine. See "belling the cat". Yeah, it's a great idea. The implementation's the tricky bit. > Would anybody else think this to be an interesting property to have for > the linux kernel or care to comment on this idea? > > Cheers, > > Adi Zaimi > Rutgers University Don't you guys have professors you can ask about this sort of thing? (Or are you going to the camden campus, says the alumni who survived the first year of Whitman's budget cuts...) Rob From: John Alvord Subject: Re: kernel upgrade on the fly Date: Wed, 19 Jun 2002 10:22:59 -0700 On Tue, 18 Jun 2002 15:37:23 -0400, Rob Landley wrote: >Thought about, yes. At length. That's why it hasn't been done. :) IMO the biggest reason it hasn't been done is the existence of loadable modules. Most driver-type development work can be tested without rebooting. john From: Rob Landley Subject: Re: kernel upgrade on the fly Date: Wed, 19 Jun 2002 12:56:03 -0400 On Wednesday 19 June 2002 01:22 pm, John Alvord wrote: > IMO the biggest reason it hasn't been done is the existence of > loadable modules. Most driver-type development work can be tested > without rebooting. That's part of it, sure. (And I'm sure the software suspend work is leveraging the ability to unload modules.) There's a dependency tree: processes need resources like mounted filesystems and open file handles to the network stack and such, and you can't unmount filesystems and unload devices while they're in use. Taking a running system apart and keeping track of the pieces needed to put it back together again is a bit of a challenge. The software suspend work can't freeze processees individually to seperate files (that I know of), but I've heard blue-sky talk about potentially adding it. (Dunno what the actual plans are, pavel machek probably would). If processes could be frozen in a somewhat kernel independent way (so that their run-time state was parsed in again in a known format and flung into any functioning kernel), then upgrading to a new kernel would just be a question of suspending all the processes you care about preserving, doing a two kernel monte, and restoring the processes. Migrating a process from one machine to another in a network clsuter would be possible too. I'm sure it's not as easy as it sounds, but looking at the software suspend work would be a necessary first step. They are, at least, serializing processes to disk and bringing them back afterwards. I'm fairly certain it's happening the microsoft word saves *.doc files (block write the run-time structures to disk and block read them back in verbatim later, and hope all your compiler alignment offsets and such match if there's any version skew). Then again, the star office people reverse engineered that and made it (mostly) work without even having access to the source code... :) Hmmm, what would be involved in serializing a process to disk? Obviously you start by sending it a suspend signal. There's the process stuff, of course. (Priority, etc.) That's not too bad. You'd need to record all the memory mappings (not just the contents of the physical and swapped out memory mappings (which should be saved to the serializing file), but also the memory protection states and memory mapped file ranges and such, so you can map it all back in at the appropriate location later). I'd bug whoever did the recent shared page table work (daniel philips?) for information about what that really MEANS. You'd need to record all the open file handles, of course. (For actual files this includes position in file, corresponding locks, etc. For the zillions of things that just LOOK like files, pipes and sockets and character and block devices, expect special case code). Pipes bring up a fun point: you can't always serialize just one process. Sometimes they clump together, and if you kill one more go down with it. Thread groups are easy to spot, as well as parent/child relationships that share memory maps and file handles and such, but even just a simple "cat blah | less" means there are two processes connected by a pipe which pretty much need to be serialized together. (A common real-world case is that one of those processes is going to be the X11 server, this brings up a WORLD of fun. For a 1.00 release it's an obvious "Don't Do That Then", and later on might have special case behavior.) If an actual file handle is open to an otherwise unlinked file, you need to either make a link to that file somewhere (not too hard, that info is already in proc/###/fs) or maybe cache the contents of the file as part of the serialized image... Which brings up the whole question of how portable a serialized program image should be. Forget swapping kernels, I mean running the system for a while before resuming the "frozen" executable. Rename a couple files and the resume is going to get confused. You kind of have to restore to the exact same system you left off at, because if you have an open fiile handle to file or device driver that isn't there on the resumed system, you basically have some variant of a "broken pipe" scenario. (Then again, forced unmount of filesystems can sort of give you this problem anyway, so infrastructure to deal with it is going to have to be faced at some point...) For rebooting a running system with the same mounted partitions and hopefully the same set of device drivers, this isn't really any worse than software suspend. And detecting a missing file and having the resume fail with an error would be pretty easy. But also pretty darn easy to trigger, but that's the user's problem... What other resources attach to a process? The process infos itself (user ID, capabilities), memory mappings, file handles... Bound sockets... Signal handlers and masks... I/O port mappings and such if you're running as root... It's not an unsolvable problem, but it IS a can of worms. Just plain reparenting a process turned out to be complicated enough they made reparent_to_init (see kernel/sched.c). Rob From: Adi To: linux-kernel mailing list Subject: Re: kernel upgrade on the fly Date: Thu, 20 Jun 2002 16:19:59 -0400 (EDT) Thanks for the responses especially Rob. I was trying to find previous threads about this and could not find them. Agreed, swsusp is a step further to that goal; the way that memory is saved though may not make it necessarily easier, at least in the current state of swsusp. As you were mentioning, the processes information needs to be summarised and saved in such a way that the new kernel can pick up and construct its own queues of processes independent on the differences between the kernels being swapped. Well, this does touch the idea of having migrating processes from one machine to others in a network. In fact, I dont understand why is it so hard to reparent a process. If it can be reparented within a machine, then it can migrate to other machines as well, no? Rob, I am going to the Newark campus FYI, and have interests in some AI stuff. Thanks again, Adi From: Rob Landley Subject: Re: kernel upgrade on the fly Date: Fri, 21 Jun 2002 09:42:44 -0400 > Thanks for the responses especially Rob. I was trying to find previous > threads about this and could not find them. Agreed, swsusp is a step > further to that goal; the way that memory is saved though may not make it > necessarily easier, at least in the current state of swsusp. Several people have mentioned process migration in clusters. Jessee Pollard says he expects to see checkpointing of arbitrary user processes working this fall, and then Nick LeRoy replied to him about the condor project, which apparently does something similar in user space... http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.2/1017.html http://www.cs.wisc.edu/condor/ You might also want to look at the crash dump code (and the multithreaded crash dump patch floating around in the 2.5 to-do list) as another starting point, since A) it's flushing user info for a single process into a file in a well-known format, B) such a file can already be loaded back in and at least somewhat resumed by the Gnu Debugger (gdb). > As you were mentioning, the processes information needs > to be summarised and saved in such a way that the new kernel can pick up > and construct its own queues of processes independent on the differences > between the kernels being swapped. Which isn't impossible, I remember migrating WWIV message base files from version to version a dozen years ago. Good old brute force did the job: new->field=old->field; There's almost certainly a more elegant way to do it, but brute force has the advantage that we know it could be made to work... As for maintaining a "convert 2.4.36->2.4.37" executable goes, (to be released with each kernel version,) the fact there's a patch file to take the kernel's source from version to version should help a LOT with figuring out what structures got touched and what exactly needs to be converted. Still needs a human maintainer, though. It's also bound to lag the kernel releases a bit, but that's not such a bad thing... > Well, this does touch the idea of having migrating processes from one > machine to others in a network. In fact, I dont understand why is it so > hard to reparent a process. If it can be reparented within a machine, then > it can migrate to other machines as well, no? A process can touch zillions of arbitrary resources, which may not BE there on the other machine. If you have an mmap into "/usr/thingy/rutabega/arbitrary/database/filename.fred" and on the remote machine fred is there, the contents are identical, but the directory "arbitrary" is owned by the wrong user so you don't have permission to descend into it (or the /etc/passwd file gives the same username a different pid/assigns that pid to a different username...) Or how about fifos: are they all there on the resume? Fifos are kind of brain damaged so it's hard to re-use them, so "create, two connects, delete" is a pretty common strategy. The program has the initial setup and negotiation code, but not And can the processes at each end be restored, in pairs, such that they still communicate with each other properly? What about a process talking to a one-to-many server like X11 or apache or some such? Freezing the server to go with your client is kind of overkill, eh? Gotta draw a line somewhere if you're going to cut out a running process and stick it in an envelope... The easy answer is have the restore fail easily and verbosely, and have attempt 0.1 only able to freeze and restore a fairly small subset of processes (like the distributed.net client and equivalents that sit in their corner twiddling their thumbs really fast), and then add on as you need more. The wonderful world of shared library version skew is not something checkpointing code should really HAVE to deal with, just fail if the environment isn't pretty darn spotless and hand these problems over to the "migration" utility. If you're restoring back on top of the same set of mounted filesystems, and you're only doing so once (freeze processes, reboot to new kernel, thaw processes, discard checkpoint files), your problem gets much simpler. Still, did your reboot wipe out stuff in /tmp that running processes need? (Hey, if it's on shmfs and you didn't save it...) Also, restoring one of these frozen processes has a certain amount of security implications, doesn't it? All well and good to say "well the process's file belongs to user 'barbie', and the saved uid matches, so load it back in", except that what if it was originally an suid executable so it could bind to some resource and then drop privelidges? How do you know some user trying to attack the system didn't edit a frozen process file? You pretty much have to cryptographically sign the files to allow non-root users to load them back in (public key cryptography, not md5sum. Gotta be a secret key or a user, with your source code, could replicate the process of creating one of these suckers with arbitrary contents in userspace...) Again, less of a problem in a "trusted" environment, but this is unix we're talking about, and unless you're makng an embedded system to put in a toaster it will probably be attached to the internet. And another easy answer is "don't do that then", or "only allow root to restore the suckers" (that last one probably has to be the case anyway, make an suid executable to verify the save files via a gpg signature if you REALLY want users to be able to do this, I.E. shove this problem into user space... :) > Rob, I am going to the Newark campus FYI, and have interests in some AI > stuff. > Thanks again, I'm just trying to give you some idea how much work you're in for. Then again, Linus is on record as saying that if he knew how much work the kernel would turn out to be, he probably never would have started it... :) > Adi Rob