Kqueue(2), the scalable facility for event notification in the FreeBSD kernel, has experienced various performance and stability issues over the last several months. This has largely been fallout from the ongoing introduction of fine-grained locking into kernel subsystems and kqueue's reliance on GIANT, further complicated by an abundance of entangled callback routines.
There has recently been some discussion on the freebsd-arch list concerning kqueue’s shortcomings and potential areas for redesign. In mid-April, Brian Fundakowski Feldman submitted controversial proof-of-concept code that relieves kqueue of some of its less utilized features and adds an internal global lock into the kqueue subsystem. He has reported positive results:
"I believe I have come up with a good solution to the kqueue woes in 5.X, and I'd like to get some feedback on work that so far is letting me (on uniprocessor, at least) run make -j8 buildworld, with USE_KQUEUE in make(1), with no ill effect :) The locking thus far is one global kqueue lock, and I firmly believe we should use MUTEX_PROFILING to determine if we should lock it down any further at this point."
For a brief introduction to kqueue, its present issues, and details on testing this code, please read on.
Written by Jonathan Lemon and introduced in FreeBSD-5.0 and FreeBSD-4.1, kqueue addresses performance limitations inherent in select() and poll(), the traditional mechanisms for Unix process event notification. Kqueue offers significant improvements over these methods by providing linear response time as the number of file descriptors increase as well as generalizing notification activities to handle socket events, vnode operations, file system events, signal delivery and process creation/exit. Since its introduction in 2002, kqueue has been ported to Linux, OpenBSD, NetBSD [story], and MacOSX (Darwin).
The largest issue surrounding FreeBSD's kqueue implementation is that it's not multiprocessor-safe, a problem that was brought to light in 5.1-current and has since been identified as a major focus area for testing in the 5.3R todo-list. As described by Stefan Farfeleder (who has recently been granted commit bits) in Problem Report kern/57945:
"The current kqueue implementation does not seem to be MP-safe. The kqueue facility does not having its own locks, I believe it was intended to rely on Giant. As Giant gets locked down, kqueue was apparently forgotten. The KNOTE() macro is spread over the whole kernel, sometimes called with Giant hold, sometimes not. The function knote_enqueue() (called via knote() and KNOTE_ACTIVATE()) inserts a node into a TAILQ, this operation isn't atomic. If another processor executes KNOTE() or kevent() at the same time, the queue might get corrupted."
A very educational freebsd-arch discussion in mid-April concerning mutex lock recursion provides a sufficient outline of the complexities of kqueue's recursion. A followup thread from Brian Fundakowski Feldman highlights his solution which he then elaborates on through the process of several addendums. Due to the subject matter being spread among multiple threads, below are Brian Feldman's successive announcements. Links to full threads are provided in the reference section at the end of this article, along with Jonathan Lemon's original whitepaper on kqueue.
From: Brian Fundakowski Feldman [email blocked] To: freebsd-arch Subject: kqueue giant-locking (&kq_Giant, locking) Date: 2004-04-17 2:12:42 I believe I have come up with a good solution to the kqueue woes in 5.X, and I'd like to get some feedback on work that so far is letting me (on uniprocessor, at least) run make -j8 buildworld, with USE_KQUEUE in make(1), with no ill effect :) The locking thus far is one global kqueue lock, and I firmly believe we should use MUTEX_PROFILING to determine if we should lock it down any further at this point. There are several major differences so far (of course, fixing that stack-paged-out-kernel-crash-bug is one of them) and several major things still to be fixed. 1. The recursion has been removed from kqueue. This means kqueues cannot be added to other kqueues for EVFILT_READ -- yes, that ability has been around since r1.1 of kern_event.c, but it is utterly pointless and if you take a look at my previous patch, severely complicates many things. Of course, I'm sure someone will notice and complain, but there isn't any documentation that suggests you should kevent() another kqueue(). 2. Because of this, KNOTE() can't end up calling another KNOTE() unless the consumer does something stupid (call KNOTE() from filter::event()). 3. Kqueue does the locking for you when it comes to the non-object lists. All of the filter::attach() and filter::detach() routines need to lock their object lists, but they don't touch kqueue or knote other than setting their own knote's fields. Both of those routines are called without any locks held on kqueue's part. 4. The filter::event() routines are called with internal kqueue locking held. You can lock anything else you need to, but you may not sleep; it is essentially like an interrupt handler. You must not call into KNOTE() with locks held, but you should reference your object. I've fixed what appears to be the most egregious offender, sys_pipe.c 5. If KNOTE() as an interrupt does not work for you, you may call KNOTE() with any locks you like except the ones it uses internally (mainly filedesc and file), but the only information you can give your filter::event() is the hint argument. Examples of #4 are bpf and pipe; they do not need to pass any information in the filter::event() hint, and as every handler that works on the object instead of on hints needs to do, they verify for certain whether or not the KNOTE() should have actually fired and ignore falses. The biggest example of #5 is process events. There are many different process-type locks that may be held when KNOTE() is called, but the implementation of filter::event() is mostly correct in locking nothing. In kern_fork.c, KNOTE() is called outside of the proc lock (p1->p_klist not locked as it should be) because it has to be special-cased somehow. This is the most disgusting thing EVAR. (NB: See http://green.homeunix.org/~green/kqueue-locking.1.patch for that.) Current patch at: http://green.homeunix.org/~green/kqueue-giant-locking.0.patch -- Brian Fundakowski Feldman
From: Brian Fundakowski Feldman [email blocked] To: freebsd-arch Subject: stable kqueue locking up and running on SMP Date: 2004-04-21 4:39:52 Please test out and provide feedback for the latest iteration: Things are now working in exactly the way I described I was going to implement them -- there is one kqueue lock, and if, as I suspect, it will _not_ be a point of contention, there is no reason at all to complicate things further. To do finer-grained locking, klists will have to be overhauled to become an actual, protected, object instead of hanging along with the object they represent. More complication in the form of read-locking (or holds) will also be necessary for that. For now, though, they belong to _kqueue_, and not struct selinfo or the object that contains them normally. A more complicated scheme for kqueues, knotes, klist(-alike)s would benefit greatly from divorce of kqueue from struct filedesc, if anyone is ever to tackle that (if there is ever a return on investment; MUTEX_PROFILING should help us figure that out.) The only known issue on my SMP box is this warning at boot time; I am unable to track down why witness(4) believes there to be a lock order reversal, but I welcome someone else to help identify this one: lock order reversal 1st 0xc0661500 global kqueue lock (global kqueue lock) @ kern/kern_event.c:413 2nd 0xc45bc438 filedesc structure (filedesc structure) @ kern/kern_event.c:422 Stack backtrace: backtrace(c0616b80,c45bc438,c060fd6e,c060fd6e,c0610941) at backtrace+0x17 witness_checkorder(c45bc438,9,c0610941,1a6,c455d594) at witness_checkorder+0x6f6 _mtx_lock_flags(c45bc438,0,c0610938,1a6,c455d594) at _mtx_lock_flags+0x9a kqueue(c45b73f0,db138d14,c19970ec,8133000,0) at kqueue+0x120 syscall(2f,2f,2f,8133000,2d) at syscall+0x272 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (362), eip = 0x282a7b8f, esp = 0xbfbfbcac, ebp = 0xbfbfc368 --- The major stress testers I know about are using make -j with -DUSE_KQUEUE compilation flags on make(1) and src/tools/regression/gaithstress. I notice no problems, but I haven't also tested to make sure nested kqueues are doing what is expected. The only missing feature is the ill-conceived NOTE_TRACK. I do not think that returning for that one note type EINVAL is a deal-breaker when trying to make kqueue not suck for 5.3; it is far more useful as a novelty than utility, especially considering I've never seen mention of it in actual software that uses kqueue. -- Brian Fundakowski Feldman
From: Brian Fundakowski Feldman To: freebsd-arch Subject: no more WITNESS errors (was: stable kqueue locking up and running) Date: 2004-04-21 18:43:08 I know everyone's tired of seeing all the e-mail on the subject, so I'll keep it brief ;). The kqueue fields that just happened to be stored in the struct filedesc have always been owned by kqueue, and now they are locked by kqueue, not filedesc, to become more semantically correct. Just like all of the rest of the things (klist, knote, kqueue, fdp->fd_kn*), the lock right now is just a shared, global one, and should remain so unless profiling proves it unnecessary. I would like to encourage more widespread testing. The unused and nearly-unimplementable-EVFILT_PROC+NOTE_TRACK is gone, but the ability to nest kqueues remains (and will certainly impart difficulties if locking is extended to separate the individual klists). I don't think there are any remaining issues for kqueue unless I've missed replacing some of the SLIST_INSERT_HEAD()/SLIST_REMOVE() calls with KLIST_INSERT()/KLIST_REMOVE(). "Final" patch, ripe for testing/MUTEX_PROFILING, at: -- Brian Fundakowski Feldman
DFBSD
Didn't Dragonfly BSD say that this was going to be one of the problems when migrating to finely grained locks? Just curious.
It is interesting that such a problem has come without any earlier transition away from large lock requirements. (I do not follow any BSD mailing lists so I may be totally wrong on this.)
Re: DFBSD
I believe this *was* anticipated as being a major challenge in earlier discussions. I assume many folks had this in mind - many developers have been aware of the recursion issues and kqueue's broken behavior, but other work took precedence and the solution isn't obvious. Like many things in BSD, once an issue is targeted in the release engineering process, it gets solved with careful planning (and a little knudging, in this case ;).
Linux?
"Since its introduction in 2002, kqueue has been ported to Linux, OpenBSD, NetBSD [story], and MacOSX (Darwin)."
Does Linux also have the same problems?
Re: Linux?
It appears as though the ports of kqueue to Linux (a direct port and a kqueue-style facility called sysevq) have largely gone ignored, so chances are that it hasn't experienced adequate stress on this platform to say one way or the other. A blessed adoption of kqueue on Linux seems to be a sticking point between the two communities.
Re: Linux?
Apparently the Linux solution does not suffer the problems found in FreeBSD's kqueues. The Linux camp has implemented sys_epoll to solve the problems with select and poll.
In the mailinglist references associated with this article some FreeBSD developers claim that FreeBSD 5.x might one day scale to up
8 CPU's (with the scalability limitations having nothing to do with kqueues).
So, one single kqueue subsystem lock is believed to be "good enough" in FreeBSD 5.x. These same references point out that the solution proposed
in this article have architectual issues which must be resolved before
the proposed solution can loose the subsystem lock.
Global and subsystem locks have been pretty much removed from Linux nowadays so where scalability is concerned Linux is in a different league (scaling up to 512 CPU's in some single imageservers available from SGI).
>where scalability is concern
>where scalability is concerned Linux is in a different league
>(scaling up to 512 CPU's in some single imageservers available
>from SGI).
That's not your stock linux kernel.
>where scalability is concern
> That's not your stock linux kernel.
True, but your Redhat, Suse, Debian etc kernels are also patched
in some way. I know of no distribution running an unpatched kernel.
The point is that the patches make it into the mainline kernel when there is concencus that they are of benefit in a broad scope.
Additionally IBM guys regularly test the stock 2.6.x kernel on 32 CPU boxes so
these kernels perform well in systems with this number of CPU's.
Anyway, to get back to the point of this thread; I think it is fair to say
that the Linux alternative to kqueues does not suffer the issues found
in FreeBSD kqueues. For if I am to believe what I have read on the FreeBSD lists
kqueues don't even work reliably on 2 CPU systems.
kernel
They use 2.4 based kernels, however as far as I know all kernel modifications are opensource licensed and many are backports of stuff from 2.6, like the scheduler for example.
Many core subsystems of 2.6 like the MM, VFS, block device layer, process/thread handling, are much more scalable than 2.4. It will be interesting to see what 2.6 will do after SGI tunes it for their large systems (which they are in the process of doing now).
Apparently one last big problem in 2.6 was the block and scsi layers performing worse than SGI's XSCSI stuff, but not long ago the last global lock in the block layer was removed. 2.6's block device layer now performs well on these large systems too.
I think they are aiming for 1024 CPUs.
where scalability is concern
SGI has both heavily patched 2.4 and very lightly patched 2.6 tree (mostly adding features SGI wants like new device drivers, but only
verty little and non-intrusive core changes, most of which are merged
in later versions) running on machines with 512cpus.