Kqueue(2), the scalable facility for event notification in the FreeBSD kernel, has experienced various performance and stability issues over the last several months. This has largely been fallout from the ongoing introduction of fine-grained locking into kernel subsystems and kqueue's reliance on GIANT, further complicated by an abundance of entangled callback routines.
There has recently been some discussion on the freebsd-arch list concerning kqueue’s shortcomings and potential areas for redesign. In mid-April, Brian Fundakowski Feldman submitted controversial proof-of-concept code that relieves kqueue of some of its less utilized features and adds an internal global lock into the kqueue subsystem. He has reported positive results:
"I believe I have come up with a good solution to the kqueue woes in 5.X, and I'd like to get some feedback on work that so far is letting me (on uniprocessor, at least) run make -j8 buildworld, with USE_KQUEUE in make(1), with no ill effect :) The locking thus far is one global kqueue lock, and I firmly believe we should use MUTEX_PROFILING to determine if we should lock it down any further at this point."
For a brief introduction to kqueue, its present issues, and details on testing this code, please read on.
Written by Jonathan Lemon and introduced in FreeBSD-5.0 and FreeBSD-4.1, kqueue addresses performance limitations inherent in select() and poll(), the traditional mechanisms for Unix process event notification. Kqueue offers significant improvements over these methods by providing linear response time as the number of file descriptors increase as well as generalizing notification activities to handle socket events, vnode operations, file system events, signal delivery and process creation/exit. Since its introduction in 2002, kqueue has been ported to Linux, OpenBSD, NetBSD [story], and MacOSX (Darwin).
The largest issue surrounding FreeBSD's kqueue implementation is that it's not multiprocessor-safe, a problem that was brought to light in 5.1-current and has since been identified as a major focus area for testing in the 5.3R todo-list. As described by Stefan Farfeleder (who has recently been granted commit bits) in Problem Report kern/57945:
"The current kqueue implementation does not seem to be MP-safe. The kqueue facility does not having its own locks, I believe it was intended to rely on Giant. As Giant gets locked down, kqueue was apparently forgotten. The KNOTE() macro is spread over the whole kernel, sometimes called with Giant hold, sometimes not. The function knote_enqueue() (called via knote() and KNOTE_ACTIVATE()) inserts a node into a TAILQ, this operation isn't atomic. If another processor executes KNOTE() or kevent() at the same time, the queue might get corrupted."
A very educational freebsd-arch discussion in mid-April concerning mutex lock recursion provides a sufficient outline of the complexities of kqueue's recursion. A followup thread from Brian Fundakowski Feldman highlights his solution which he then elaborates on through the process of several addendums. Due to the subject matter being spread among multiple threads, below are Brian Feldman's successive announcements. Links to full threads are provided in the reference section at the end of this article, along with Jonathan Lemon's original whitepaper on kqueue.
From: Brian Fundakowski Feldman [email blocked] To: freebsd-arch Subject: kqueue giant-locking (&kq_Giant, locking) Date: 2004-04-17 2:12:42 I believe I have come up with a good solution to the kqueue woes in 5.X, and I'd like to get some feedback on work that so far is letting me (on uniprocessor, at least) run make -j8 buildworld, with USE_KQUEUE in make(1), with no ill effect :) The locking thus far is one global kqueue lock, and I firmly believe we should use MUTEX_PROFILING to determine if we should lock it down any further at this point. There are several major differences so far (of course, fixing that stack-paged-out-kernel-crash-bug is one of them) and several major things still to be fixed. 1. The recursion has been removed from kqueue. This means kqueues cannot be added to other kqueues for EVFILT_READ -- yes, that ability has been around since r1.1 of kern_event.c, but it is utterly pointless and if you take a look at my previous patch, severely complicates many things. Of course, I'm sure someone will notice and complain, but there isn't any documentation that suggests you should kevent() another kqueue(). 2. Because of this, KNOTE() can't end up calling another KNOTE() unless the consumer does something stupid (call KNOTE() from filter::event()). 3. Kqueue does the locking for you when it comes to the non-object lists. All of the filter::attach() and filter::detach() routines need to lock their object lists, but they don't touch kqueue or knote other than setting their own knote's fields. Both of those routines are called without any locks held on kqueue's part. 4. The filter::event() routines are called with internal kqueue locking held. You can lock anything else you need to, but you may not sleep; it is essentially like an interrupt handler. You must not call into KNOTE() with locks held, but you should reference your object. I've fixed what appears to be the most egregious offender, sys_pipe.c 5. If KNOTE() as an interrupt does not work for you, you may call KNOTE() with any locks you like except the ones it uses internally (mainly filedesc and file), but the only information you can give your filter::event() is the hint argument. Examples of #4 are bpf and pipe; they do not need to pass any information in the filter::event() hint, and as every handler that works on the object instead of on hints needs to do, they verify for certain whether or not the KNOTE() should have actually fired and ignore falses. The biggest example of #5 is process events. There are many different process-type locks that may be held when KNOTE() is called, but the implementation of filter::event() is mostly correct in locking nothing. In kern_fork.c, KNOTE() is called outside of the proc lock (p1->p_klist not locked as it should be) because it has to be special-cased somehow. This is the most disgusting thing EVAR. (NB: See http://green.homeunix.org/~green/kqueue-locking.1.patch for that.) Current patch at: http://green.homeunix.org/~green/kqueue-giant-locking.0.patch -- Brian Fundakowski Feldman
From: Brian Fundakowski Feldman [email blocked] To: freebsd-arch Subject: stable kqueue locking up and running on SMP Date: 2004-04-21 4:39:52 Please test out and provide feedback for the latest iteration: Things are now working in exactly the way I described I was going to implement them -- there is one kqueue lock, and if, as I suspect, it will _not_ be a point of contention, there is no reason at all to complicate things further. To do finer-grained locking, klists will have to be overhauled to become an actual, protected, object instead of hanging along with the object they represent. More complication in the form of read-locking (or holds) will also be necessary for that. For now, though, they belong to _kqueue_, and not struct selinfo or the object that contains them normally. A more complicated scheme for kqueues, knotes, klist(-alike)s would benefit greatly from divorce of kqueue from struct filedesc, if anyone is ever to tackle that (if there is ever a return on investment; MUTEX_PROFILING should help us figure that out.) The only known issue on my SMP box is this warning at boot time; I am unable to track down why witness(4) believes there to be a lock order reversal, but I welcome someone else to help identify this one: lock order reversal 1st 0xc0661500 global kqueue lock (global kqueue lock) @ kern/kern_event.c:413 2nd 0xc45bc438 filedesc structure (filedesc structure) @ kern/kern_event.c:422 Stack backtrace: backtrace(c0616b80,c45bc438,c060fd6e,c060fd6e,c0610941) at backtrace+0x17 witness_checkorder(c45bc438,9,c0610941,1a6,c455d594) at witness_checkorder+0x6f6 _mtx_lock_flags(c45bc438,0,c0610938,1a6,c455d594) at _mtx_lock_flags+0x9a kqueue(c45b73f0,db138d14,c19970ec,8133000,0) at kqueue+0x120 syscall(2f,2f,2f,8133000,2d) at syscall+0x272 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (362), eip = 0x282a7b8f, esp = 0xbfbfbcac, ebp = 0xbfbfc368 --- The major stress testers I know about are using make -j with -DUSE_KQUEUE compilation flags on make(1) and src/tools/regression/gaithstress. I notice no problems, but I haven't also tested to make sure nested kqueues are doing what is expected. The only missing feature is the ill-conceived NOTE_TRACK. I do not think that returning for that one note type EINVAL is a deal-breaker when trying to make kqueue not suck for 5.3; it is far more useful as a novelty than utility, especially considering I've never seen mention of it in actual software that uses kqueue. -- Brian Fundakowski Feldman
From: Brian Fundakowski Feldman To: freebsd-arch Subject: no more WITNESS errors (was: stable kqueue locking up and running) Date: 2004-04-21 18:43:08 I know everyone's tired of seeing all the e-mail on the subject, so I'll keep it brief ;). The kqueue fields that just happened to be stored in the struct filedesc have always been owned by kqueue, and now they are locked by kqueue, not filedesc, to become more semantically correct. Just like all of the rest of the things (klist, knote, kqueue, fdp->fd_kn*), the lock right now is just a shared, global one, and should remain so unless profiling proves it unnecessary. I would like to encourage more widespread testing. The unused and nearly-unimplementable-EVFILT_PROC+NOTE_TRACK is gone, but the ability to nest kqueues remains (and will certainly impart difficulties if locking is extended to separate the individual klists). I don't think there are any remaining issues for kqueue unless I've missed replacing some of the SLIST_INSERT_HEAD()/SLIST_REMOVE() calls with KLIST_INSERT()/KLIST_REMOVE(). "Final" patch, ripe for testing/MUTEX_PROFILING, at: -- Brian Fundakowski Feldman