Kerneltrap has spoken with Linux kernel hacker, Andrew Morton. His contributions cover a wide range of kernel components, including ext3 on 2.4 and the low-latency patch. Currently he works for moxi.com.
Please share a little about yourself and your background...
I'm a 42 year-old English-born Aussie. Electrical Engineering graduate
from the University of New South Wales. Married, three kids who are
still so young that I can beat them at Half-Life. We're now in Palo
Alto, California, doing the obligatory stretch in Silicon Valley.
What is your current involvement with moxi.com?
Moxi Digital Inc. is a startup developing an integrated home
entertainment platform. The company was officially launched at the
start of this year. With some style - we were slashdotted and then
took out best of show and best of category at CES 2002. Not bad for a
little Linux box :) We were code-complete two weeks beforehand, which
was pretty amazing.
I am a member of the System Software group - kernel, performance,
hardware bringup, 802.11a device driver, etc. H.J. Lu (Linux binutils
maintainer) works here. Also Zack Brown (the Kernel Traffic guy), San
Mehat, Jeremy Fitzhardinge and a few other googleable names.
When did you get started with Linux?
Well, I've always been that-way inclined. Back in '86 I developed a
build-it-yourself 68000-based computer. Both the hardware and its
unix-like operating system. We sold about 400 of them. We licensed
Minix from Macmillan and my great friend Colin McCormack ported it - I
think this may have been the only non-PC port of Minix. The Applix 1616
project was fun, and a lot of people learned a lot of things.
I spent nine years at Nortel Networks R&D, a lot of it in management.
Nortel has a very strong management culture, and it was a great
experience being responsible for delivering complex five-nines products
into the hands of very demanding customers. But I felt that I wasn't
learning any more, and it was time to cease being a PHP and go back to
development. Hence Moxi.
I've been a Linux user since about '94. Came a day in March 2000 when
I decided to give this 2.3.47 thing a whirl, only to discover that Alan
had marked my ninety-cent-NIC as obsolete! We couldn't have that, so I
sent Linus a 2,500 line patch. It was rather satisfying doing things
which others found useful, so I kept on doing it.
We spoke with Robert Love last October, learning about his preempt patch. How
does your low-latency patch differ from his?
Ingo Molnar broke the ground here with his 2.2.12 patch which demonstrated
that Linux could fairly easily yield task activation delays which are one
to two orders of magnitude better than any competing operating system.
The approach taken by these patches is basically cooperative multitasking.
The developer identifies sections of long-running kernel code and changes
them so that they will yield the CPU to another task if the scheduler
says that it's time to do that. Most of the complexity here is in
being able to back out of any locking before yielding, and in cleanly
reacquiring locking state when the interrupted task resumes.
With an internally preemptible kernel the explicit task yielding is not
necessary, because the context switch is performed in the interrupt
return path and via open-coded yields which are hidden in the unlock
code. But you cannot preempt an in-kernel process while it holds
locks, so all the unlock, relock and fixup code is needed in either
The low-latency patch yields worst-case latencies of around 1.5
milliseconds at present. The preempt patch is around 80 milliseconds,
but with the locking changes it should also yield 1-2 millisecond
In other words, the low-latency patch should have a much more noticeable affect
than the preempt patch?
I'd expect low-latency to be a little more noticeable, for normal
desktops use. The problem areas which affect desktop use are very few,
and both patches address them. There is a very small and simple patch
which is sort of ping-ponging between Andrea and myself which will
sufficiently address day-to-day interactivity. We need to get that
But even the stock kernel has quite good interactivity - it is
extremely rare for the kernel to hold the CPU for longer than a monitor
refresh interval. I suspect that there's something psychological at
play here. Very low Latency is really a quite specialised requirement;
it's the more demanding high-end audio, video and game applications
which need the harder sub-two-millisecond responsiveness.
Have you worked with Robert to combine efforts?
Not to any great extent. Robert has a "lock-break" patch which is based
on the low-latency patch's lock-mangling code. It would be quite simple
to change the low-latency patch to take advantage of the preempt patch,
so the low-latency patch basically *is* the lock-break patch.
Is the low-latency patch compatible with the Ingo's new O(1) scheduler, in 2.5
as well as with the patch for 2.4?
Yes, some people are using this combination, however I am not maintaining such a patch.
Do you have plans to get your low-latency patch into the 2.5 kernel?
Not in its present form. First I'd like to see a general consensus
that Linux should become a low-latency OS. With that, we can then
merge the preempt patch and start work on carefully and cleanly
addressing the various long-held locks. Simply dropping locks and then
cleaning up the mess is not a very satisfactory solution - better would
be to change the local locking design, or to speed the locked code up
by a factor of one hundred. The latter usually is not possible.
I do not plan on spending much time on 2.5 for a while. I'd rather
invest effort in things which are more immediately needed by people.
There are still many correctness, performance and
quality-of-implementation issues to be addressed in 2.4.x. The 2.4.x
core has only stabilized very recently and there remains quite a bit of
tuning and mop-up work.
What are some of the most outstanding issues that still need to be addressed
One big one is (surprise) the VM. It is still not working adequately.
Andrea has a patch which I'm sure will improve things. But this patch
is big enough to stun an elephant and needs to be split up, cleaned up
and fed into the tree in a way in which we can monitor its effects.
By all accounts it is working well for SuSE customers and those who have
tested it, but until it's fully integrated, more widely tested and
everyone is reasonably happy with it, we have a VM problem.
Apart from that, various reports of machines mysteriously locking up
under load. The ia32 APIC handling still seems to be wonky. Very bad
disk read latencies when there is a heavy write load. Reports of
disappointing filesystem throughput - Andrea's patches may solve most
of these, but of course we don't know yet. Andre's big IDE patch needs
to be merged in and settled down.
We're getting a number of reports of corruption of some long pointer
chains in the kernel. Some of these will be due to bad memory, but not
all, I suspect. Something somewhere is stomping on memory. Possibly
the problem was introduced around the 2.4.13 timeframe. I'm collecting
these reports, trying to discern a pattern.
The "Kernel of Pain" thread at Slashdot was very, very interesting - I
think it shows that we just ain't there yet. See
Do you have any predictions as to when 2.4 will reach a stable state?
Not really. Six months, perhaps?
How will we know when it's there?
When I say it is :-)
I guess when the number of reports of repeatable and obviously bad
behaviour has tapered off to a very small level.
How did you get involved with the ext3 filesystem?
At that time, filesystems and VM were a gap in my knowledge. It seemed
that ext3 was languishing somewhat, and that this was a way to learn
new parts of the kernel while doing something useful. Later, ext3
became a Moxi product requirement - owners do not expect household
appliances to spend fifteen minutes running fsck.
Another compelling reason: I looked at the code. Stephen Tweedie is a
true artisan. The "journalling block device" layer of ext3 is easily
the most complex part of the kernel with which I have had experience.
Yet I was able to sufficiently understand it in one or two months; this
is almost entirely due to his skillful commenting.
Peter Braam had done the initial 2.4 port and I mainly concentrated on
the grungy testing, test tools, instrumentation, profiling, tuning,
locking, VM interactions, performance issues and generally getting
things into shape for a Linus merge. Andreas Dilger helped out with
quite a few of the architectural issues. And of course, there is the
e2fsprogs package maintained by Ted Ts'o. ext3 wouldn't be worth squat
without those tools.
After a while, when he saw I actually had something which worked, and
after he had recovered from the birth of his first child, Stephen came
on-line. He did a lot of work and fixed some fairly deep problems, one of
which I'd probably still be scratching my head over.
What was the problem?
Don't ask - this was deep ext3-fu. There are some flags associated
with journallable disk blocks which indicate that the block has been
"revoked" - not to be replayed on recovery. But these flags were
getting confused by buffer cache aliases of the same blocks. It was
triggering one of the 150 assertion checks.
How does the performance compare between ext3 on 2.2, and on 2.4?
I have not tested for that. I'd expect the CPU load to be a lot less,
and the disk throughput to be somewhat higher.
I've been happily using ext3 without problems since it was first introduced
into the -ac kernels. How stable do you consider this filesystem to be? How
much does the code base differ from ext2?
Ext3 is very stable. I just test it to death - since the release I've
found many more bugs than the user population has, and I trust it. The
indications are that ext3 has had a very high adoption rate. It may
already be the most-used Linux 2.4 filesystem; it's very hard to tell.
The filesystem is absolutely paranoid about looking after your data -
it has 150 internal consistency checks, any one of which will
deliberately crash your machine if it is violated. This protects your
data from both software and hardware bugs, and ensures that the
developers get to hear about any problems. That is just good,
hard-nosed engineering practice.
There are a number of somewhat ugly things (my doing) which do need to
be cleaned up. But the ext3 port started very late - the first usable
version was for kernel 2.4.5. And it works. Stephen and I are very
much of the same pragmatic mindset here - if it works, don't futz with
it unless there's a damn good reason.
There is a throughput glitch which can be experienced under some loads
- stalls on large writeouts. These can be tuned away, as described in
Daniel Robbins' second article (linked to from
http://www.zip.com.au/~akpm/linux/ext3/) but we do need a native fix to
the ext3 write scheduling for this.
Ext3 has a tendency to start the disk up frequently, which is
irritating for laptop users. I have a personal fix for that, but
Stephen would shoot me if I published it :)
Longer-term, I'd like to address synchronous operations - for
mailspools and such. ext3's performance is already exceptional with
these, but can be improved. Also on-line defragmentation, which will
enable a redesign of the inode allocation policy, which will yield huge
speedups. It is quite alarming how much ext2 and ext3 can be sped up
via this. But it's a hard and potentially risky problem to solve.
The architecture of ext3 in 2.4 is basically unchanged from 2.2. But
the implementation is significantly different. The 2.2 version's many
changes to the core kernel had to be hoisted out. The changed VFS
locking rules, the changed VM design and the quest for performance and
robustness necessitated many changes.
In Linux, filesystems normally do not perform I/O for user data.
Instead, they provide the core kernel with a mapping between user data
and disk blocks and the core kernel handles the I/O. But for
data-journaling filesystems (ext3 being the only one we have), much of
this was quite wrong. It took some time to sort this out without
copying great chunks of the core kernel into ext3.
I found only three or four bugs in the 2.2 filesystem, and only one of
those was serious.
What about resierfs? XFS? JFS? It seems there are several journaling
alternatives to ext3, though none of the others have the convenience of
compatibility with ext2.
They all have their strengths and weaknesses.
ext3 is the only one which takes care of user data as well as disk
metadata - with the other filesystems, a crash+recovery can leave old,
stale disk block contents inside files. This is a security issue, but
the value of ext3's behavior can be overstated - even though the ext3
file will contain the correct data, it could be shorter than you
expect, which makes it still useless.
Reiserfs has better small-file performance. XFS should have better
large-file performance. I'm not sure about JFS; the code looks really
clean and nice - it uses all the standard kernel infrastructure in the
designed manner. I'd expect JFS to perform well with large amounts of
The ext2-compatibility is a bit of an albatross around ext3's neck, in
a way. It makes people assume that ext3 is just ext2 with a journal
kludged onto it. ext3 is a physical, block-level journalled
filesystem; this is a perfectly legitimate design approach and it makes
the addition of full data journaling quite simple. And given that you
have decided to implement journaling this way, it's only sensible to
make the on-disk format compatible with ext2. And not just to ease the
conversion - we mustn't forget the excellent support tools in
e2fsprogs; ext3 gets the support of e2fsprogs because of the on-disk
What is it about ext3 that causes it to start the disk up frequently?
If any application generates a write, ext3 will send that write to disk
within five seconds. And even on a fairly idle machine, there's always
something which is generating writes. With ext2, one can tune the
dirty data writeout interval to be very large, so the data simply
remains in memory for a long time. ext3 ignores that tunable.
A lot of miscellaneous write activity can be caused by file access time
updates. Laptops should always use the `noatime' mount option against
all partitions to prevent this.
How does your fix solve the problem?
It sets the journal commit interval to infinity, so the writeout
interval is governed by the bdflush tunable. That's OK. But it also
disables the fsync() function completely, because there are various
applications which think that you want them to sync their data to disk.
What's wrong with your solution, that you won't publish it?
Nobbling fsync() is not a thing we do in polite company. It's there
for a reason - to commit changes to non-volatile storage. Maybe making
it a mount option would be OK, but it's not really the "ext3 way".
What comparison can you make between the current 2.4.x VM, and Rik van Riel's
Rik's was, I believe, an ambitious attempt to provide quite advanced
page replacement algorithms. But I do not believe that the code in the
mainstream kernel ever implemented those algorithms correctly. The
version in the -ac kernel stream got close, and works pretty well.
The current page replacement code is simpler, more straightforward - it
will be less prone to weird interactions and corner cases.
But a lot of people understood the old VM - the new one is rather a
closed book, and until the current off-stream patch is integrated
there's not a lot of point in investing effort in understanding the VM.
Are you happy with Linus' choice with the new VM?
I often wish that Linus would reject patches due to
inadequate commentary, and like much of the kernel the VM remains
poorly documented, and it is also _important_. But apart from that,
Andrea is a freakishly good programmer, and I expect the VM problems
will be settled soon.
What other contributions have you made to Linux?
I'm fairly unique in that I don't really "own" anything in the kernel.
I've spent almost two years with my nose stuck in other people's stuff,
fixing problems. I think in all that time I've added just two teeny
files to the tree.
The bugfixing and performance tuning has led me into most of the
subsystems in the kernel as well as the core. And I simply must say:
if the rest of the kernel was as skilfully commented as ext3, I would
be able to improve Linux at a much higher rate. Make of this what you
will. I get fairly worked up about this problem.
What main tools do you use when developing?
I use a powerful laptop as workstation, CVS and NFS server. And an SMP
machine for testing and kernel builds across NFS (and for whupping the
kids at half-life).
With this setup, the laptop has an image of the currently-running
kernel and source tree, which is just what you need for running kgdb
(GDB for the Linux kernel) across a serial cable.
I use kgdb extensively. Couldn't live without it, and my development
practices are built around it. My eyes just roll when I see other
developers struggling with crashes and livelocks, when I know they
could simply hit ^C and go for a source-level poke into the live kernel
internals. It's rather weird.
My editor of choice is Dde. It started out in life as the inbuilt
editor in the Applix 1616 EPROM. The code is horrid but I can do some
pretty whizzy things with it. Plus it is WordStar-compatible!
Linus has refused to include a debugger with the standard Linux kernel. What
do you think of this?
It is a costly mistake.
The distributed nature of kernel development and testing means that the
developers are often faced with remote email-based problem diagnosis.
We do not get to make on-site customer visits. These debug sessions are
slow and often end in failure - it's a huge problem. We need to put
the best possible tools in the hands of the testers and users so that
we can diagnose problems which the developers cannot reproduce.
Do you use other operating systems besides Linux?
I really can't be bothered fiddling with 3d support on Linux, so I
switch to the Dark Side for whupping the kids. Nothing else.
Have you met Linus? Alan? Other kernel hackers?
Yes, I met most everybody at the Ottawa Linux Symposium in 2000 and
at the 2.5 summit last year. They were both great opportunities.
BTW: If you ever get offered a blind date with a kernel developer, ask
for Rusty Russell. He's a clever and really witty person.
Do you care to elaborate more on this humorous comment?
He's just a really sharp and witty guy - anyone who has attended one of
his sessions at a conference will attest to that!
What do you enjoy doing when you're not hacking on or otherwise using your
Oh dear. I guess I'm rather dull. I took up Australian football at
the ripe age of thirty-five, and spent a few years getting biffed by
guys half my age. But then I blew my knee and got fat.
But I've always been a privateer kernel developer, and fitting that
around a real job and a family doesn't leave a lot of time.
What advice can you offer to people just learning to become kernel hackers?
There is a great amount to learn, and you don't learn it by reading the
code, or by reading the mailing list. You have to actually get in
there and change something. It's only when you try to do the hands-on
things that you realize how much you don't know.
And to get hands-on, you need a *reason*.
Profile the kernel, try to fix a performance bottleneck.
Pick a neglected bug report off the kernel list, work it with the
originator, see if you can fix it.
Pick a neglected device driver or filesystem, try to improve it.
There are new block and net driver interfaces coming through in 2.5 -
pick a driver for which you have hardware and take care of it.
Don't spout off opinions on the kernel mailing list.
Ignore what the kernel developers say about C++.
If you need to write, say, a net driver then don't! First, take an
existing one and make it beautiful, or more correct, or twice as fast,
or add a feature. Then take that knowledge to the new driver.
One hot tip: if you spot a bug which is being ignored, send a
completely botched fix to the mailing list. This causes thousands of
kernel developers to rally to the cause. Nobody knows why this
happens. (I really have deliberately done this several times. It works).
Write a really good filesystem and VM performance measurement suite
which models real-world workloads. The existing ones are almost
useless for this, and this gap in the toolset makes it much harder to
tune filesystems, and to discover performance improvements and
There are many things to do, and Linux doesn't really need thousands
of people adding even more code.
Can you offer any examples of botched fixes you've intentionally proposed?
Anything which contains the text "I don't know diddly about..." :)
There was a TCP throughput regression in 2.2.17-preX, some module
refcounting problems in the 1394 stack, a recent lockup with the FAT
filesystem comes to mind.
Is there anything else you'd like to add?
The rate of kernel development has always struck me as being quite
slow. The 2.4 series has been in "feature-freeze" for almost two
years, and even now I'd say that it is in a late beta state.
Unfortunately, the remaining problems are *hard* ones. And it is often
the case that the developers simply cannot reproduce the reporters'
problems. A reproducible bug is a fixed bug, and an unreproducible bug
remains a mystery.
I would like to see the kernel code reflect this reality - that users
and developers are far apart, that they often speak different
languages, that the user has limited time, patience and skill for
testing patches and for running tests.
We would be able to harden and tune the kernel a lot faster if it
contained decent internal diagnostic, logging and telemetry features.
But it has none of these, and those developers who have not moved over
into the 2.5 stream are left with a big problem.
Unfortunately, we are not likely to get this.
Also, there has been quite a lot of talk lately about kernel
development processes, patches getting dropped, etc. I think it's all
terribly overblown. The people who aren't being heard (and who aren't
even bothering to comment) are the _users_ of that system - the
developers. We're all just rolling our eyes and waiting for it to
stop. The current system could be more efficient, but it mostly works
OK; it is very unlikely to change and anything like a kernel fork is
hugely improbable, even if Linus gets bored of it all and decides to do
Thank you for all your time! Your Linux contributions are much appreciated by
myself and many others!
About the interviewer:
Jeremy Andrews was born and raised in Southeast Alaska. Currently he lives and works in South Florida. He maintains KernelTrap as a hobby.