KernelTrap has spoken with Peter Chubb who currently works for the Gelato Project. His efforts are presently focused on supporting large disks and partitions, utilizing 64-bits. Regarding the project's focus of improving Linux support for the Itanium 64-bit processor, Peter says, "Back in the days when the VAX was king, there was a general assumption amongst some programmers that `all the world's a vax'. In the Linux world, there's a similar assumption: `all the world's a pentium'."
Peter lives in New South Wales, Australia, with his wife, Lucy, also a kernel hacker, and two daughters. He earned a PhD under the late John Lions, author of the Lions Book. His UNIX kernel hacking experience is with an impressively large number of kernels.
Jeremy Andrews: Please share a little about yourself and your background...
Peter Chubb: Hi,
I work at and live near the University of NSW, (http://www.unsw.edu.au) which is *the* hotbed of Operating Systems research in Australia. It also has a very good church (http://www.matthias.org.au) that meets on campus. I'm married to another kernel hacker, Lucy Chubb and we have three daughters, the two youngest (ages 1 and 2.5) still living.
I have undergraduate degrees in electrical engineering and physics from UNSW, and a PhD that I did under the late A/Prof John Lions (whom many of your older readers will remember as author of the Lions Book http://www.salon.com/tech/feature/1999/11/30/lions and http://www.peer-to-peer.com/catalogs/opsrc/lions.html)
JA: Beyond the Gelato project, what other sorts of exciting things are happening at the University of NSW?
Peter Chubb: I don't know everything that's going on -- only the things near to the Gelato project.
As one of the major performance bottlenecks on a modern architecture is the small TLB compared with the size of the physical address space, anything that improves its usage is good.
If you can arrange for shared memory regions (e.g., shared library text) to be mapped at the same virtual address in all processes that use the region then a single TLB entry can be used for all processes --- at least in theory. Doing this in a general purpose operating system such as Linux is harder (although SGI did something similar in Irix)
Likewise, increasing page size means fewer TLB entries to cover the same virtual address space --- which means fewer TLB misses which means improved performance.
Other non-operating system stuff that's exciting is the robotics and
artificial intelligence synergy that led to the UNSW team winning the
robot soccer competition (see http://www.cse.unsw.edu.au/~robocup
JA: How and when did you get started working with computers? On UNIX kernels?
Peter Chubb: My first use of a computer was at a church camp, `CampTech' where one of the features was a Micro-SC/MP computer, that one could program by switches on the front panel. Two years later I started at UNSW (in 1979) where a locally-modified 7th edition Unix on a pdp-11/70 with 128k memory supported 40 or so simultaneous users. The following year I taught myself C using the on-line `learn' system and a first edition K&R. However, I didn't get stuck into the Unix (or any other) kernel until I started on my PhD in 1985. My wife at this stage was involved in porting 7th edition UNIX to the NS16032, and I helped to review her code.
JA: Is this how you and your wife first met?
Peter Chubb: No, we met in the residential college we were both at, and through the Christian group there.
JA: What are some of her other kernel involvements?
Peter Chubb: She's working with me on the Gelato project --- she's currently working on super-page support for IA64 linux; she was working at Aurema, the same as me, but usually working on different things.
JA: By the on-line 'learn' system, do you mean actually writing code and experimenting?
Peter Chubb: Back in 7th edition UNIX there was an on-line teaching program called `learn'. It gave you exercises to do, then checked that the answers worked. So yes, writing code and experimenting, but in a structured manner.
JA: On your homepage you mention having worked for Aurema Pty Ltd for over 10 years. What did your work there involve?
Peter Chubb: Lots of things:
JA: Can you explain a little about how checkpoint-restart works?
Peter Chubb: Checkpoint-restart (CPR) and resource management are related but not directly.
The Hibernator project (as it became) was to allow arbitrary processes and process groups to be stopped, saved to disc, and continued at some later time. The idea was, that on a supercomputer that needed regular down time for maintenance, you could save all the important work before bringing the machine down, then restart it later (one machine I worked on needed its cooling system cleaned once a month, and so it was down for half a day the first Monday of each month. Consequently, any jobs that took more than a month to run, after Hibernator was installed, could be checkpointed before the scheduled downtime and restarted afterward. )
You could use CPR for resource management in conjunction with NQS or similar batch scheduling systems. Usually when using NQS you specify a CPU-time limit per queue; if a job takes more than that it's killed. With CPR, you can checkpoint the job instead, and at some later time restart it.
JA: What UNIX flavors have you done kernel work on?
Peter Chubb: Lots... 7th Edition, SVr4, SVr4ES, UXP/M, UXP/DS, IRIX 6.2 and 6.5, Solaris, Reliant, Tru64, Linux, FreeBSD, SCO UnixWare, and maybe some others that I don't remember.
JA: Of these different UNIX flavors, do any stand out as having a superior overall design?
Peter Chubb: 7th edition. Many of the others look internally more like road-crashes between different previous versions.
JA: What is it about the 7th Edition that made it superior?
Peter Chubb: Simple and clean --- you could carry about the model of how it worked in your head, and there were few or no exceptions to that model. Most of the others started out clean but were filled with accretions and excrescences where other systems had a feature that was then bolted on to the current system. One particular system (unfortunately I'm still under NDA or I'd tell you which) had great swathes of repeated functionality, done in slightly different ways in order to present slightly different interfaces to different kernel subsystems, some implementing USG functionality, others UBC. (That's Unix System's Group, the System-V variant of UNIX, and University of Berkeley, California; the so-called BSD strain of UNIX)
JA: How did you get involved with Linux?
Peter Chubb: I bought a 486DX50 machine and installed the Manchester Computer Centre Linux distribution on it. This machine was primarily because I was interested in serving text documents locally, so it hosted a then-complete copy of the Gutenberg archive and some text indexing and compression tools. Unfortunately, work got really busy so I never completed the project.
JA: Now that you've used Linux a while, what are your impressions, especially compared to the many other UNIX kernels you've worked with?
Peter Chubb: Because Linux does not have the same lineage as the others (all of which can trace their roots back to 6th-edition and 7th edition UNIX), I find it a little harder to find my way around than I did the others. However, the same things have to be done, they're just done in different places or using different algorithms.
So far, Linux is reasonably clean, and it looks to me as if Linus is trying to keep it that way.
JA: What are some of the major differences you've noticed?
Peter Chubb: The biggest difference is superficial --- in edition-7 derived trees, there're parallel per-architecture and generic trees. One builds in the per-architecture tree; it pulls stuff in as required from the generic tree. The biggest advantage of that is that one doesn't have to have everything on one's disc -- just the architectures one's working on (actually, you could do the same with Linux, but because the architecture dependent trees are subdirectories of the main Linux tree it's a tiny bit harder to separate them out. And as discs are getting bigger all the time, why bother?)
JA: When did you work with FreeBSD? What was your impression of the FreeBSD kernel?
Peter Chubb: It was just after FreeBSD came out, and I can't remember much about it.
JA: I've read a little about your current involvement with the Gelato project at the University of New South Wales. The project's home page mentions that you're currently working to clean up the system interfaces to be 64-bit ready. What exactly is involved in this effort?
Peter Chubb: Back in the days when the VAX was king, there was a general assumption amongst some programmers that `all the world's a vax'. In the Linux world, there's a similar assumption: `all the world's a pentium'. In particular, many programmers assume that a function pointer is the same size as a data pointer, that a long is the same size as an int (and is 32 bits), and that a pointer is the same size as a long.
These assumptions are not valid in all cases. In particular, on the Itanium, a function pointer is 128 bits, a data pointer is 64 bits, a long is 64 bits and an int is 32 bits. Part of what I'm doing is gradually going through and looking for assignments and arithmetic that assumes things about the size of an object, and fixing them.
One of the first things I noticed was that the number of blocks on a disc was measured sometimes as an int, and sometimes as a long, hence...
JA: You recently submitted a patch to the Linux kernel mailing list to clean up support for large disks and large partitions. Have you received much feedback on this patch? Any from
Peter Chubb: The response has been overwhelming to me. I've had an awful lot of email to answer --- but none as yet from Linus. However, I'm hopeful that as the feedback I've received from other kernel maintainers is worked into the patch, eventually it'll become part of the mainline kernel. Even without the configuration option for 64-bit block offsets on 32-bit platforms, the patch cleans up the interfaces so that a 64-bit platform can use enormous discs with no performance penalty.
JA: What are some of the improvements that have been suggested?
Peter Chubb: Just little things -- moving stuff into different headers, making the config option architecture dependent, etc.
JA: How complete is the patch? Does it allow a 64-bit platform running Linux to fully utilize 64-bits when accessing filesystems?
Peter Chubb: That's a complicated question. The answer is `I don't know', but I'm trying to find out. The issues are:
As far as I know (and I'm trying to dig up access to the hardware to confirm this) IA64 with my patch and a qlogic1280 or similar high-end controller will be able to talk to as large a disc system as is readily obtainable today.
JA: So the actual limits are determined by current technology. What are the theoretical limits?
Peter Chubb: On a 64-bit system, you'll be limited by the file system layout. File systems that use 64-bit offsets can use block devices up to 2^64 bytes (because you're still limited by the range of an off64_t; with a larger off_t you could go up to 2^64 blocks, but I can't see that happening)
JA: I read that you've installed lockmetering onto an SMP machine. What is lockmetering?
Peter Chubb: Lockmetering comes from the SGI world --- it's a way of seeing which locks are `hot' -- i.e., contested. After the kernel patch is installed, every time a lock is taken the kernel measures how long a thread has to wait for the lock to become available, and how long it holds it. Every 24 hours or so I look at the statistics to see where
there are problem areas.
If a lock is contested highly, it's a sign that there may be performance problems associated with the code around the lock.
The Gelato project is about making IA64 work really well. Although there's been quite a lot of work in reducing latency for other platforms, there's been little done on IA64 yet --- hence the lockmetering.
At present, I'm just gathering data.
JA: What sorts of problems have you tracked down with this method?
Peter Chubb: The big kernel lock being held for over a second!
JA: What is the 'big kernel lock'?
Peter Chubb: When Linux was first ported to an SMP machine, all accesses to the kernel were protected by a single lock, the `Big Kernel Lock'. Access to the BKL is a bottleneck. So recent work has been to replace the BKL with finer grain locking, to reduce contention. However, there are still some places where it's used. Many people are working on removing it, and either introducing algorithms that don't need locking, or using locks that protect access to much smaller pieces of data.
JA: What other kernel related projects are you currently involved in?
Peter Chubb: Not a lot just now. I'm hoping to finalize the big-discs patch soon, and start adding preemption support to the IA64 kernel; then maybe other things. I'm very interested in kernel threads, and may do some work there later on.
JA:What do you intend to do with kernel threads?
It's a bit early to say yet... On IA64, the L4 kernel team has got context switching between threads in the same address space down to around 200 instructions. My measurements on IA64-Linux show thread context switch times an order of magnitude bigger than this (around 1.2microseconds on an 800MHz machine). Introducing true light-weight threads (as opposed to the current model using clone, where a thread is just a process that happens to share some state with its parent) may go some way toward reducing thread overhead.
JA: Is kernel threading handled differently than user-space threading?
Peter Chubb: Yes. And no. Traditionally, POSIX threading libraries have worked on one of two models: either each user mode thread was bound to a single kernel thread, or many user mode threads were multiplexed onto fewer kernel threads. The Linux threading library does the former; although there's work going on at IBM (http://oss.software.ibm.com/developerworks/projects/pthreads) to implement the latter. Either way, user-mode threads while running are bound to a kernel thread.
The kernel threading model is not the POSIX-threading model used in userland; the locking and synchronization primitives are quite different.
Within the kernel, there's little distinction between threads and processes. Context switching between threads (which *ought* to be cheaper than between processes) takes about the same length of time as between processes.
JA: What do you enjoy doing when you're not working on your computer?
Peter Chubb: Playing with my daughters, playing music (clarinet, recorders, piano, & guitar), reading (my wife and I have a library of around 5500 books, almost all catalogued), and watching the fish in my aquarium. Our family does not have a television.
JA: With both of their parents being kernel hackers, would you say a similar future is in store for your daughters?
Peter Chubb: Well,... it's certain that they'll understand more of what a computer really does, rather than just a games machine.
JA: Do you and your wife both play music?
Peter Chubb: Yes. Lucy plays percussion --- doumbek mostly --- and we play together in the Church music team.
JA: What sorts of books are in your library?
Peter Chubb: Oh, lots of different kinds. Around 400 cookery books (both Lucy and I are keen cooks), a large number of fiction books (especially science, fantasy and detective fiction, but also a complete collection of Georgette Heyer, an almost complete collection of Biggles, all the OZ books, etc). I collect books on the heroic age of Antarctic exploration; Lucy collects books of Chinese and SE-Asian art (she's a painter as well as everything else); and there's the usual rag-bag of reference books that almost anyone has (including the 20-volume OED))
JA: What advice can you offer to those of us only beginning to become involved in kernel hacking?
Peter Chubb: Read the code. One of the difficulties I have/had is there's so much assumed and not documented. A good way to get into things is to start documenting some bit of undocumented code.
JA: Would the Lions book you mentioned earlier still be useful to people working in UNIX? How about people starting to learn the Linux kernel?
Peter Chubb: Yes. The Lions book describes a real, small, operating system (and you can now sign up for a free license for that system) that you can hold the whole of in your head. The problems it solves are the same as the problems we have today --- arbitrating access to scarce resources between competing processes belonging to different users. Many of the solutions it adopts are still there (albeit sometimes in different forms) in all modern Unices.
JA: Thank you very much for talking with me! I look forward to the day I have a drive large enough to utilize your 64-bit patch.