Linux: Live Kernel Upgrades

Submitted by Jeremy
on June 21, 2002 - 4:19pm

The question has been asked before, and the answer is always the same: It's a lot more complicated than it sounds.

Adi Zaimi recently asked on the lkml about the possibilty of allowing for live kernel upgrades, giving the ability to upgrade from one kernel to another without rebooting and interrupting services.

This time, however, Rob Landley responded in a series of very informative emails explaining just how complicated of a prospect it is to perform live kernel upgrades. In response to the question, he first says, "Thought about, yes. At length. That's why it hasn't been done. :)" He then goes on to point out many of the details that complicate such efforts. Rob points out two projects of interest to anyone willing to attempt live upgrades. One project is working on suspending a live kernel to restart it later. The other is working on a "two kernel monte", allowing one to boot from one kernel to another. Rob's full emails follow.


From: Adi Zaimi
To: linux-kernel kernel list
Subject: kernel upgrade on the fly
Date: 	Tue, 18 Jun 2002 17:21:49 -0400 (EDT)

Hi all,

 has anybody worked or thought about a property to upgrade the kernel
while the system is running?  ie. with all processes waiting in their
queues while the resident-older kernel gets replaced by a newer one.

I can see the advantage of such a thing when a server can have the kernel
upgraded (major or minor upgrade) without disrupting the ongoing services
(ok, maybe a small few-seconds delay). Another instance would be to
switch between different kernels in the /boot/ directory (for testing
purposes, etc.) without rebooting the machine.

A search of the web resulted in no related information to the above so I
dont know if such an issue has been raised before.

Would anybody else think this to be an interesting property to have for
the linux kernel or care to comment on this idea?

Cheers,

Adi Zaimi
Rutgers University

From: Rob Landley
Subject: Re: kernel upgrade on the fly
Date: 	Tue, 18 Jun 2002 15:37:23 -0400

On Tuesday 18 June 2002 05:21 pm, Adi Zaimi wrote:
> Hi all,
>
>  has anybody worked or thought about a property to upgrade the kernel
> while the system is running?  ie. with all processes waiting in their
> queues while the resident-older kernel gets replaced by a newer one.

Thought about, yes.  At length.  That's why it hasn't been done. :)

Closest you'll get at the moment is some variant of two kernel monte, I.E. a 
reboot to a new kernel with all processes offed, but at least without 
involving the bios.

The new swsup infrastructure from pavel machek theoretically lets you freeze 
the state of your system to disk, so we're a heck of a lot farther ahead then 
we were.  If you want to re-open this can of worms, the only way to go is to 
start with some combination of these two projects:

http://falcon.sch.bme.hu/~seasons/linux/swsusp.html
http://sourceforge.net/projects/monte/

That said, the fundamental problem is that when you change kernels, run-time 
state structures change.  Parsing your run-time state from oldvers to feed 
into newvers can't really be done automatically because your tool wouldn't 
know what any of the changes MEAN, so you would probably have to write a 
custom frozen process converter, which would be a pain and a half to debug, 
to say the least.  (And by the time you've got that even half debugged you 
need to do it for the NEXT kernel...)

Of course software suspend theoretically deals with at least some of the 
device driver issues, so there's a certain amount of handwaving you can do on 
that end.  And migrating hot network connections is something people have in 
fact done before, although you'll have to ask around about who.  (Ask the 
security nuts, they consider it a bad thing. :)

Nothing is impossible for anyone impervious to reason, and you might suprise 
us (it'd make a heck of a graduate project).  Hot migration isn't IMPOSSIBLE, 
it's just a flipping pain in the ass.  But the issue's a bit threadbare in 
these parts (somewhere between "are we there yet mommy?" and "can I buy a 
pony?").  Try the swsup mailing list, they might be willing to humor you...

(And the people most likely to WANT this feature ("this system never goes 
down" types) are also the least likely to want to deal with subtle bugs from 
a bad conversion that don't manifest until a week after the new system comes 
up when cron goes nuts at 3 am.  Of course whether hot migration it's more 
dangerous to your data than the interaction between Andre's and Martin's 
egoes in the ATAPI layer is an open question... :)  Ahem.  Right...)

The SANE answer always has been to just schedule some down time for the box.  
The insane answer involves giving an awful lot of money to Sun or IBM or some 
such for hot-pluggable backplanes.  (How do you swap out THE BACKPLANE?  
That's an answer nobody seems to have...)

Clusters.  Migrating tasks in the cluster, potentially similar problem.  Look 
at mosix and the NUMA stuff as well, if you're actually serious about this.  
You have to reduce a process to its vital data, once all the resources you 
can peel away from it have been peeled away, swapped out, freed, etc.  If you 
can suspend and save an individual running process to a disk image (just a 
file in the filesystem), in such a way that it can be individually re-loaded 
later (by the same kernel), you're halfway there.  No, it's not as easy as it 
sounds. :)

> I can see the advantage of such a thing when a server can have the kernel
> upgraded (major or minor upgrade) without disrupting the ongoing services
> (ok, maybe a small few-seconds delay). Another instance would be to
> switch between different kernels in the /boot/ directory (for testing
> purposes, etc.) without rebooting the machine.

See "belling the cat".  Yeah, it's a great idea.  The implementation's the 
tricky bit.

> Would anybody else think this to be an interesting property to have for
> the linux kernel or care to comment on this idea?
>
> Cheers,
>
> Adi Zaimi
> Rutgers University

Don't you guys have professors you can ask about this sort of thing?  (Or are 
you going to the camden campus, says the alumni who survived the first year 
of Whitman's budget cuts...)

Rob

From: John Alvord
Subject: Re: kernel upgrade on the fly
Date: 	Wed, 19 Jun 2002 10:22:59 -0700

On Tue, 18 Jun 2002 15:37:23 -0400, Rob Landley wrote:

>Thought about, yes.  At length.  That's why it hasn't been done. :)

IMO the biggest reason it hasn't been done is the existence of
loadable modules. Most driver-type development work can be tested
without rebooting.

john

From: Rob Landley
Subject: Re: kernel upgrade on the fly
Date: 	Wed, 19 Jun 2002 12:56:03 -0400

On Wednesday 19 June 2002 01:22 pm, John Alvord wrote:
> IMO the biggest reason it hasn't been done is the existence of
> loadable modules. Most driver-type development work can be tested
> without rebooting.

That's part of it, sure.  (And I'm sure the software suspend work is 
leveraging the ability to unload modules.)

There's a dependency tree: processes need resources like mounted filesystems 
and open file handles to the network stack and such, and you can't unmount 
filesystems and unload devices while they're in use.  Taking a running system 
apart and keeping track of the pieces needed to put it back together again is 
a bit of a challenge.

The software suspend work can't freeze processees individually to seperate 
files (that I know of), but I've heard blue-sky talk about potentially adding 
it.  (Dunno what the actual plans are, pavel machek probably would).  If 
processes could be frozen in a somewhat kernel independent way (so that their 
run-time state was parsed in again in a known format and flung into any 
functioning kernel), then upgrading to a new kernel would just be a question 
of suspending all the processes you care about preserving, doing a two kernel 
monte, and restoring the processes.  Migrating a process from one machine to 
another in a network clsuter would be possible too.

I'm sure it's not as easy as it sounds, but looking at the software suspend 
work would be a necessary first step.  They are, at least, serializing 
processes to disk and bringing them back afterwards.  I'm fairly certain it's 
happening the microsoft word saves *.doc files (block write the run-time 
structures to disk and block read them back in verbatim later, and hope all 
your compiler alignment offsets and such match if there's any version skew).

Then again, the star office people reverse engineered that and made it 
(mostly) work without even having access to the source code... :)

Hmmm, what would be involved in serializing a process to disk?  Obviously you 
start by sending it a suspend signal.  There's the process stuff, of course.  
(Priority, etc.)  That's not too bad.  You'd need to record all the memory 
mappings (not just the contents of the physical and swapped out memory 
mappings (which should be saved to the serializing file), but also the memory 
protection states and memory mapped file ranges and such, so you can map it 
all back in at the appropriate location later).  I'd bug whoever did the 
recent shared page table work (daniel philips?) for information about what 
that really MEANS.

You'd need to record all the open file handles, of course. (For actual files 
this includes position in file, corresponding locks, etc.  For the zillions 
of things that just LOOK like files, pipes and sockets and character and 
block devices, expect special case code).

Pipes bring up a fun point: you can't always serialize just one process.  
Sometimes they clump together, and if you kill one more go down with it.  
Thread groups are easy to spot, as well as parent/child relationships that 
share memory maps and file handles and such, but even just a simple "cat blah 
| less" means there are two processes connected by a pipe which pretty much 
need to be serialized together.  (A common real-world case is that one of 
those processes is going to be the X11 server, this brings up a WORLD of fun. 
 For a 1.00 release it's an obvious "Don't Do That Then", and later on might 
have special case behavior.)

If an actual file handle is open to an otherwise unlinked file, you need to 
either make a link to that file somewhere (not too hard, that info is already 
in proc/###/fs) or maybe cache the contents of the file as part of the 
serialized image...

Which brings up the whole question of how portable a serialized program image 
should be.  Forget swapping kernels, I mean running the system for a while 
before resuming the "frozen" executable.  Rename a couple files and the 
resume is going to get confused.  You kind of have to restore to the exact 
same system you left off at, because if you have an open fiile handle to file 
or device driver that isn't there on the resumed system, you basically have 
some variant of a "broken pipe" scenario.  (Then again, forced unmount of 
filesystems can sort of give you this problem anyway, so infrastructure to 
deal with it is going to have to be faced at some point...)

For rebooting a running system with the same mounted partitions and hopefully 
the same set of device drivers, this isn't really any worse than software 
suspend.  And detecting a missing file and having the resume fail with an 
error would be pretty easy.  But also pretty darn easy to trigger, but that's 
the user's problem...

What other resources attach to a process?  The process infos itself (user ID, 
capabilities), memory mappings, file handles...  Bound sockets...  Signal 
handlers and masks...  I/O port mappings and such if you're running as root...

It's not an unsolvable problem, but it IS a can of worms.  Just plain 
reparenting a process turned out to be complicated enough they made 
reparent_to_init (see kernel/sched.c).

Rob

From: Adi
To: linux-kernel mailing list
Subject: Re: kernel upgrade on the fly
Date: 	Thu, 20 Jun 2002 16:19:59 -0400 (EDT)

Thanks for the responses especially Rob. I was trying to find previous
threads about this and could not find them. Agreed, swsusp is a step
further to that goal; the way that memory is saved though may not make it
necessarily easier, at least in the current state of swsusp.

As you were mentioning, the processes information needs
to be summarised and saved in such a way that the new kernel can pick up
and construct its own queues of processes independent on the differences
between the kernels being swapped.

Well, this does touch the idea of having migrating processes from one
machine to others in a network. In fact, I dont understand why is it so
hard to reparent a process. If it can be reparented within a machine, then
it can migrate to other machines as well, no?

Rob, I am going to the Newark campus FYI, and have interests in some AI
stuff.
Thanks again,

Adi

From: Rob Landley
Subject: Re: kernel upgrade on the fly
Date: 	Fri, 21 Jun 2002 09:42:44 -0400

> Thanks for the responses especially Rob. I was trying to find previous
> threads about this and could not find them. Agreed, swsusp is a step
> further to that goal; the way that memory is saved though may not make it
> necessarily easier, at least in the current state of swsusp.

Several people have mentioned process migration in clusters.  Jessee Pollard 
says he expects to see checkpointing of arbitrary user processes working this 
fall, and then Nick LeRoy replied to him about the condor project, which 
apparently does something similar in user space...

http://www.uwsg.iu.edu/hypermail/linux/kernel/0206.2/1017.html

http://www.cs.wisc.edu/condor/

You might also want to look at the crash dump code (and the multithreaded 
crash dump patch floating around in the 2.5 to-do list) as another starting 
point, since A) it's flushing user info for a single process into a file in a 
well-known format, B) such a file can already be loaded back in and at least 
somewhat resumed by the Gnu Debugger (gdb).

> As you were mentioning, the processes information needs
> to be summarised and saved in such a way that the new kernel can pick up
> and construct its own queues of processes independent on the differences
> between the kernels being swapped.

Which isn't impossible, I remember migrating WWIV message base files from 
version to version a dozen years ago.  Good old brute force did the job: 
new->field=old->field;  There's almost certainly a more elegant way to do it, 
but brute force has the advantage that we know it could be made to work...

As for maintaining a "convert 2.4.36->2.4.37" executable goes, (to be 
released with each kernel version,) the fact there's a patch file to take the 
kernel's source from version to version should help a LOT with figuring out 
what structures got touched and what exactly needs to be converted.  Still 
needs a human maintainer, though.  It's also bound to lag the kernel releases 
a bit, but that's not such a bad thing...

> Well, this does touch the idea of having migrating processes from one
> machine to others in a network. In fact, I dont understand why is it so
> hard to reparent a process. If it can be reparented within a machine, then
> it can migrate to other machines as well, no?

A process can touch zillions of arbitrary resources, which may not BE there 
on the other machine.  If you have an mmap into 
"/usr/thingy/rutabega/arbitrary/database/filename.fred" and on the remote 
machine fred is there, the contents are identical, but the directory 
"arbitrary" is owned by the wrong user so you don't have permission to 
descend into it (or the /etc/passwd file gives the same username a different 
pid/assigns that pid to a different username...)

Or how about fifos: are they all there on the resume?  Fifos are kind of 
brain damaged so it's hard to re-use them, so "create, two connects, delete" 
is a pretty common strategy.  The program has the initial setup and 
negotiation code, but not And can the processes at each end be restored, in 
pairs, such that they still communicate with each other properly?  What about 
a process talking to a one-to-many server like X11 or apache or some such?  
Freezing the server to go with your client is kind of overkill, eh?  Gotta 
draw a line somewhere if you're going to cut out a running process and stick 
it in an envelope...

The easy answer is have the restore fail easily and verbosely, and have 
attempt 0.1 only able to freeze and restore a fairly small subset of 
processes (like the distributed.net client and equivalents that sit in their 
corner twiddling their thumbs really fast), and then add on as you need more. 
 The wonderful world of shared library version skew is not something 
checkpointing code should really HAVE to deal with, just fail if the 
environment isn't pretty darn spotless and hand these problems over to the 
"migration" utility.

If you're restoring back on top of the same set of mounted filesystems, and 
you're only doing so once (freeze processes, reboot to new kernel, thaw 
processes, discard checkpoint files), your problem gets much simpler.  Still, 
did your reboot wipe out stuff in /tmp that running processes need?  (Hey, if 
it's on shmfs and you didn't save it...)

Also, restoring one of these frozen processes has a certain amount of 
security implications, doesn't it?  All well and good to say "well the 
process's file belongs to user 'barbie', and the saved uid matches, so load 
it back in", except that what if it was originally an suid executable so it 
could bind to some resource and then drop privelidges?  How do you know some 
user trying to attack the system didn't edit a frozen process file?  You 
pretty much have to cryptographically sign the files to allow non-root users 
to load them back in (public key cryptography, not md5sum.  Gotta be a secret 
key or a user, with your source code, could replicate the process of creating 
one of these suckers with arbitrary contents in userspace...)

Again, less of a problem in a "trusted" environment, but this is unix we're 
talking about, and unless you're makng an embedded system to put in a toaster 
it will probably be attached to the internet.  And another easy answer is 
"don't do that then", or "only allow root to restore the suckers" (that last 
one probably has to be the case anyway, make an suid executable to verify the 
save files via a gpg signature if you REALLY want users to be able to do 
this, I.E. shove this problem into user space... :)

> Rob, I am going to the Newark campus FYI, and have interests in some AI
> stuff.
> Thanks again,

I'm just trying to give you some idea how much work you're in for.  Then 
again, Linus is on record as saying that if he knew how much work the kernel 
would turn out to be, he probably never would have started it... :)

> Adi

Rob

hmmm... I think I get it...

Anonymous
on
June 23, 2002 - 6:57pm

Well I'm kinda busy at the moment so I only read the first few comments... what I'm wondering is: is this possible? I often use some signal that I'm not quite sure of because I only use it in keystroke form- ctrl-z -which stops a program in its tracks so you can then (usually) type "bg" for it to background the processes, or you kill it because it is about to fsck (and no, I don't mean the program) up your system. is it possible for a script to send all processes this halting signal, remove the kernel, and since no processes besides the one that is moving the kernels is depending upon the kernel (and the moving one should be extremely low-level, as if you didn't know already :) ) nothing can break because everything else is halted. then once the new kernel is in place a "resume" (I don't know the real name- I just found out ctrl-z is STOP) signal is sent to all the processes that were sent the STOP signal, and the system resumes normal procedure. The big problem with this is: what if the kernel is broken? sound nasty?? don't you love fscking in the morning??

What Ctrl-Z does...

alex
on
June 24, 2002 - 12:43am

All that happens is you suspend the tasks in the current kernel. They are still there taking up resources and more importantly having associated page tables, shared libraries and process states. Saving and restoring that lot is do-able (see sw-susp) but upgrade a kernel and change a byte to the kernel process table and you in a world of problems.

Alex

Let's limit ourselves a bit...

tnl
on
June 24, 2002 - 8:06am

The whole process can be simplified if you restrict yourself to hot-swapping between only some kernel versions, and not between every possible kernel version.

In theory, stable kernel versions should be binary compatible with each other, such that a module compiled for a 2.4.x kernel can be loaded into any other 2.4.x kernel (if it's compiled with similar options... which is a big IF). In practice it doesn't work out that way, but it's a starting point.

We can say that our goal is that for any two kernel versions which are binary compatible for purpose of module loading, we can perhaps reasonably hot-swap our kernels.
After all, if two kernels can load each other's modules, that means that internal data-structures are equal. (I might be wrong. But to the best of my knowledge, that's how it is.)
If internal data-structures of two kernels are the same, then no migration/mapping will need to be done of field oldvers->a(int16) to newvers->a(int32) etc etc.

So you have a lot, lot less mess to deal with.

The kernel being unloaded can check for binary compatibility with the kernel it is going to load in, and disallow the operation if it will not work.

Since you will be hot-swapping kernels, and you know kernels to be binary-compatible, a lot of swsusp-code should be fit for the job already.
There is no need for keeping around frozen-versions of programs, you wont need to worry that much about missing files, etc.
Not more than for swsusp, and perhaps even less.
You won't go thru a full boot, so your /tmp should be intact, even if it's tmpfs.

Now for cases when you *do* need to upgrade to a binary incompatible version, you will still need a cold reboot. But presumably, many of the bugfix-releases will be binary compatible. And you shouldn't want to hot-swap between development kernels. You should assume them to be broken and thrash internal state, and you shouldn't want to hotswap into another kernel which then tries to do something to that already corrupted internal state.

It would help if kernel version numbering would reflect binary compatibility: add a 4th digit to the version number, which would be incremented for all bugfix-releases which maintain binary compatibility.

So for instance, 2.4.19 won't be 2.4.19 but 2.4.19.0. The next release will be 2.4.19.1, etc, until there is a nessecity to break binary compatibility and then we'll have release 2.4.20.0.

This will give users an easy clue as to which kernel-versions can and cannot be hot-swapped IF you don't change your compile-options.

Of course, I've oversimplified the picture. But does it sound more achievable? Would it still be worth it? Anyone care to comment on this?

Binary compatability

flugstadp
on
June 24, 2002 - 11:35am

Binary compatability w.r.t MODULES is completely different
w.r.t. run-time data structures. When they say MODULE
binary compatabilty, they mean that the API (i.e. functions
and data structures) a device driver will use will stay the
same. This *rarely* touches on things like process tables,
sockets, and other kernel-internal things, which probably
don't get changed a lot, but still change.

Take 2.4.9 -> 2.4.10 switch, when Linus introduced the new VM
system. I bet that a well behaved binary module probably would
run on both, but the internal data structures that the VM system
used to represent things like a process or file or whatever
changed completely.

Module binary compatability is not sufficient for this.

Well, see, I told you it was

tnl
on
June 25, 2002 - 1:07am

Well, see, I told you it was a simplification, the way I presented things ;-)

I was hoping that module-binary compatibility would be sufficient. But if it's not, hopefully similar mechanisms can be exploited? For those who think hotswapping kernels is the hottest thing on earth? :-)

Hasn't it been done before?

Anonymous
on
June 24, 2002 - 10:49am

I'm no expert on realtime swapping on kernels, kernels, operating systems, cooking, dating or even sleeping, but doesn't BeOS perform a kernel swap, when you upgrade from BeOS 5 Personal Edition to Professional Edition?

Like I said I'm no expert or even knowledgable in these areas, but if I'm not mistaken (obviously), BeOS uses a micro-kernel - would that make such a swap easier?

Disclaimer:
I am not an expert on realtime swapping on kernels, kernels, operating systems, cooking, dating or sleeping. Should the world fall apart because you tried using any of my ideas, go sue some one else.

hotswap kernel concept

cicero (not verified)
on
December 27, 2004 - 1:07am

perhaps the runtime state change issue could be absolved by writing programs that are "aware of" or "non-dependent" or "selectively dependent" on runtime state. im getting at: if the programs modules and drivers of the linux system had a knowledge of a "kernel change" event, perhaps they could react accordingly? perhaps this could merit an entirely new system runlevel?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.