login
Header Space

 
 

Re: Interrupt, interrupt threads, continuations, and kernel lwps

Previous thread: CPU 100% on lock on file write by Manuel Bouyer on Tuesday, February 20, 2007 - 4:59 pm. (4 messages)

Next thread: So I'll bite. M:N w/o SA - how? by Bill Studenmund on Wednesday, February 21, 2007 - 3:09 pm. (10 messages)
To: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 4:08 am

After a great of pondering, I've concluded that interrupt threads are  
an extremely bad idea.

I think that hard interrupts should simply invoke the handler (at the  
appropriate IPL), the handler will disable the cause of the  
interrupt, optionally it may use a SPIN mutex to gain control over a  
shared section, do something to the device, release the mutex, or it  
may just schedule a continuation to run (either via software  
interrupt or via a kernel lwp workqueue, the former can't sleep/ 
block, the latter can).

While IPL will need to continue to exist for talking to hardware, I  
think software priority levels will eventually disappear as they  
currently exist.  They will transition to actual real-time priorities  
used by the scheduler to run kernel lwps (&lt; SPL_SCHED) or become  
mutexes (&gt;= SPL_SCHED).
To: Matt Thomas <matt@...>
Cc: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 2:01 pm

Matt and I discussed this at length over lunch yesterday, and I am in  
full agreement with him on this.

The way I see it, interrupt threads are an attempt at solving the  
synchronization problem from the wrong direction.  Instead, I think  
the right approach is to fully split every driver into a "top  
half" (runs with thread context, possibly as a high-priority kernel  
thread) and "bottom half" (runs in interrupt context).

Furthermore, I believe that the bottom half should manipulate state  
that is local only to the instance of the driver associated with the  
device that is interrupting (and have that state be spin-mutex  
protected).

This approach would have a few advantages:

1- Simplicity.  Interrupt dispatch would be largely as it is today.

2- Speed.  Because the low-level interrupt dispatch code could be  
simple and avoid magic, it could be very fast, which would help  
devices that are particularly sensitive to interrupt latency.

3- Portability.  Because the low-level interrupt dispatch would work  
basically like it does today, we know it will work on all of our  
extant platforms.  I think there are some particularly nasty "gotchas"  
with interrupt-dispatch-as-continuation-or-thread that can be hard to  
fix on platforms like VAX (especially) or even SPARC and m68k.

4- Consistency.  We already have this sort of model "sort of" today;  
consider serial drivers that have hard-interrupt handlers that run at  
extremely high priority to read data into a local ring buffer and then  
schedule a soft interrupt to do the TTY processing.  What I'm looking  
for here is to formalize this for EVERY kind of device.  This will  
make it easier to write drivers for NetBSD.  We could even provide  
some API to help people write drivers that conform to the model.

-- thorpej
To: Jason Thorpe <thorpej@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 6:51 pm

Hi Jason,


Interrupt dispatch will be simpler yes. Device drivers would become more

If you have a device that's particularly sensitive to interrupt latency,
then the two level model is a good one, I agree completely. There is
nothing to prevent that model from being used where it's a good fit.

For the general case, I don't agree though. On recent processors with long
instruction pipelines, serializing control operations are really expensive.
x86 chips from 5 years ago are faster in this regard than the current Intel
offerings - in real time, not clock cycles. I'm keen to avoid that kind of
"funneling" of work unless it's really neccessary - meaning, it's existence
actually has a justifiable benefit, So I have been trying to eliminate these
kinds of operations where they are unnecessary, e.g: during lock release,
during splx(), during syscalls, in the locking scheme devised for the
scheduler and so on.

What I want to do will on x86 add 29 arithmetic instructions to the
interrupt path (as of now). That's means in the common, non blocking case,
it's ~free. I'd be pretty bummed if we undo some of that and the reasoning


Vax and m68k we can deal with. The interrupt LWP scheme probably doesn't

Mmm.. No sale on the consistency ticket, sorry. :-)

Cheers,
Andrew
To: Andrew Doran <ad@...>
Cc: Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 7:42 pm

Ok, that sounds good. How do we make this work on SPARC, m68k, and VAX?=20
Right now, interrupts-as-threads sounds like a total loss for them. If we=
=20
can turn it into something that's "about as good" then I think you will=20
just get a green light. :-)

Take care,

Bill
To: Andrew Doran <ad@...>, Bill Studenmund <wrstuden@...>
Cc: Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 9:57 pm

In message &lt;20070221234224.GF26468@netbsd.org&gt;,

Andrew, just what will trigger the blocking (expensive) case?
And how  expensive is it?

Can I offer the following example? Suppose for the sake of discussion,
that I'm running IPsec traffic with FAST_IPSEC, and handling a modest
50,000 packets per second.  Suppose further that there's also always
userland processes ready to run.

My first guess (perhaps overly naive) on how thread-per-interrupt
would work is something like this:

	(...new IPsec'd packet comes in, asserts NIC interrupt)
1. switch from user to NIC interrupt thread
	(...  NIC calls ether_input which demuxes packet, enqueues on
	      protocol input routine, requests softint processing, blocks)
2. switch from NIC hardware thread to softint thread
	(... OCF submits job, blocks ...)
3. switch from softint to user	
	(... crypto hardware finishes, requests interrupt...)
4.  switch from user to crypto-interrupt thread
	(... crypto driver calls OCF which wakes up softint processing...)
5.	switch to softint thread, process cleartext packet
	(... done with local  kernel packet processing, softint thread  ...)
6.	switch back to user. 

[[Let's ignore, for now, whether there'd be a single softint thread,
or a separate thread for sofnet, or multiple softnet threads.]]

Is the above roughly right? If it is, then (to steal a comment I made
to Sam Leffler early in FAST_IPSEC development): 300,000 context
switches/sec are _not_ your friend.  And even if the NIC-&gt;softint and
crypto-&gt;softint are a same-VM-context switch (and thus cheap, or at
least optimizable as kthread-to-kthread), the cost of rescheduling is
high relative to the work done per packet.  And remember, the cost is
not just instructions, but the additional cache-footprint (and thus
evictions) of all these switches.  Ditto for TLB state; doubly so for
arches with software-filled TLBs.


Bill, you mean "about as good as NetBSD is now", right?
To: <jonathan@...>
Cc: Bill Studenmund <wrstuden@...>, Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 12:16 pm

Hi Jonathan,


Yes - you're describing roughly how packet processing works now. My changes
make this neither worse nor better.

Andrew
To: Andrew Doran <ad@...>
Cc: <tech-kern@...>
Date: Thursday, February 22, 2007 - 1:13 pm

Hi Andrew,

I truly don't know what to say to that. 
Here's my example case again:

        (...new IPsec'd packet comes in, asserts NIC interrupt)
1. switch from user to NIC interrupt thread
        (...  NIC calls ether_input which demuxes packet, enqueues on
              protocol input routine, requests softint processing, blocks)
2. switch from NIC hardware thread to softint thread
        (... OCF submits job, blocks ...)
3. switch from softint to user  
        (... crypto hardware finishes, requests interrupt...)
4.  switch from user to crypto-interrupt thread
        (... crypto driver calls OCF which wakes up softint processing...)
5. switch to softint thread, process cleartext packet
        (... done with local  kernel packet processing, softint thread  ...)
6. switch back to user. 

And here's an equivalent monolithic-kernel+biglock scenario,
recognizable for 4.3BSD(-Tahoe) to NetBSD-3:

        (...new IPsec'd packet comes in, asserts NIC interrupt)
1. kernel takes interrupt, calls into NIC device interrupt handler
   in currently active context.  Note no context switch.  [1]
        (...  NIC calls ether_input which demuxes packet, enqueues on
              protocol input routine, requests softint processing, returns)

2. After return from hardware interrupt handler, but before returning
   to the pre-interrupt state, the kernel checks for pending software
   interrupts.  Here, we run softints (assuming they weren't active
   at the time we took the interrupt).
   
        (... IP calls to FAST_IPSEC, to OCF, OCF submits job, returns ...)

3. continue returning from   softint to user.  Note no context switch.
        (... crypto hardware finishes, requests interrupt...)

4.  Kernel takes hardware interrupt. Note, no context switch.
    (... crypto driver calls OCF,  which calls FAST_IPSEC 
    continuation, which requests further softint processing via
    schednetisr() ...)

5.  On return from hardware interrupt, the kernel checks for
    pending sof...
To: <tech-kern@...>
Date: Thursday, February 22, 2007 - 5:59 pm

http://www.tancsa.com/blast.html
might give some insight. Especially interesting is the comparison of
FreeBSD 4 and DragonFly, as the latter uses an interrupt thread model as
well.

Joerg
To: <jonathan@...>
Cc: <tech-kern@...>
Date: Thursday, February 22, 2007 - 5:24 pm

That drop in performance what I want to avoid. I said this can just be
dropped in. What that means is until you start making use of locking
primitives in your interrupt handlers, there are no additional context
swtiches. And when you do, you had better be bechmarking it and using
vmstat, lockstat, etc. to make sure that the assumptions you have made
are good ones.

Andrew
To: <jonathan@...>
Cc: Andrew Doran <ad@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 1:46 pm

I think the problem is you've assumed an implementation, and specifically=
=20
you've assumed one other than what Andy was suggesting.

My understanding is that Andy has figured out a way to have, at least on=20
x86, the interrupt handler borrow the context of the interrupted thread.=20
So the interrupt context switch is also the context switch to the thread.=
=20

Let's try it and see! Rather than discuss hypotheticals, let's get #s.=20
Either the numbers will show little change, which will make ALL of us=20
happy, or the numbers will show a marked decrease, which will displease=20
ALL of us. :-)

Take care,

Bill
To: Bill Studenmund <wrstuden@...>
Cc: Andrew Doran <ad@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 8:49 pm

In message &lt;20070222174642.GB24922@netbsd.org&gt;, Bill Studenmund writes:


I came to that conclusion after re-reading Andrew's reply.  i think
the problem is terminology.  I'm not used to seeing taking an
interrupt called a "context switch" instead of, well, taking an

That's an unfortunate choice of terminology. it'S inviting confusion
to say that a new approach does just what we did before, when it

Numbers would be interesting.  But what kind of numbers?  Single-CPU?
Multi-CPU?  Or CPUs with (gack) virtually-indexed caches and TLBs
without address-space-IDs?  I don't see how Andrew's scheme would
really work there, but maybe I'm missing something.

The other issue I see is that Andrew is assuming that blocking (in his
sense, meaning doing the full work of the deferred context switch) is
a rare case.  I don't buy that, not for networking.  Look at the
resources put into making FreeBSD a fine-grained kernel, and notice
that FreeBSD-6 still has one big lock around the network stack and
socket-layer code.
To: <jonathan@...>
Cc: Bill Studenmund <wrstuden@...>, Andrew Doran <ad@...>, <tech-kern@...>
Date: Friday, February 23, 2007 - 12:27 am

Well, it is. :-) It is one form of context switch, and all of the=20

True.

For the common case, we will have almost the same behavior as now, except=
=20
that we can take mutexes in an interrupt handler.

As I understand it, the difficulty comes in when the mutex is held by a=20
thread that is not running. In that case, the interrupt handler blocks. In=
=20
that case, the interrupt handler has to be blocked, and the interrupt has=
=20
to be disabled/ignored until serviced. For systems with a PIC, we disable=
=20
the interrupt and cope.

For systems without a PIC, like many m68k systems (and VAX as I understand=
=20
matt), we would need to disable said interrupt in the CPU (set the chip's=
=20
interrupt level) until the service routine completes.

As noted, this will hopefully be a rare occurrence, where a thread that=20
holds an interrupt mutex is no longer on a processor. We can structure our=
=20
code so that this is very unlikely, if not impossible.

Note that in the case of the mutex being locked but the thread being on=20
another processor, we will just spin-wait. That's a feature of the locking=
=20
we copied from Solaris. I expect that this _will_ happen, but we then get=
=20

Single ande multi-CPU to be sure. But as for the other stuff, it's mostly=
=20
what we have now. So I don't see how we will have radially-different=20
results.

We should test stuff over time to make sure we don't do something stupid,=
=20

I don't think that will mater for interupt handlers. Yes, I think there's=
=20
a lot of work to do for the networking stack. But I think it'll be=20
different work.

As above, trying to get the lock while a thread running on another CPU has=
=20
it turns into a spin wait. That's not the slow case. So as long as we=20
don't hold a mutex we take in interrupt context while we do something else=
=20
that can block, we will NOT trigger the slow path in an interrupt handler.=
=20
We don't sleep while holding SPL now (or we aren't supposed to), so...
To: <jonathan@...>, Andrew Doran <ad@...>, <tech-kern@...>
Date: Friday, February 23, 2007 - 5:18 pm

Why don't be disable context switches while a mutex (that may be acquired
by an ISR) is held ?
After non-adaptive mutex are not allowed to be held across functions
that can sleep.
This would mean that the only code that might have to be executed is
the kernel code of the interrupted process.

	David

-- 
David Laight: david@l8s.co.uk
To: <jonathan@...>
Cc: Andrew Doran <ad@...>, <tech-kern@...>
Date: Friday, February 23, 2007 - 12:37 am

I don't understand Andrew's design well enough to comment directly on
it, but "interrupt handler blocks" in interrupt context is something
that leads to much difficulty in OS design. 

On the other hand, the top/bottom model in which bottom halfs run in
interrupt context with interrupts disabled but aren't allowed to block
and top halfs run in kernel thread context with interrupts enabled
(suitably adjusted for interrupt priority levels on appropriate
architectures) has a long history, scales nicely, and is well
understood. It can also use lock-free communication between the top
and bottom halfs when properly implemented.

It may be more work to get there, but it's an incremental change from
the current NetBSD model, and drivers could be converted as needed,
rather than trying to get everything right in one fell swoop.
To: Bucky Katz <bucky@...>
Cc: <jonathan@...>, Andrew Doran <ad@...>, <tech-kern@...>
Date: Friday, February 23, 2007 - 1:01 am

The design we're talking about is very strongly inspired by the Solaris=20
model.

Yes, I agree that an interupt handler blocking is a bad thing. We should=20

Given that the desire is to run the interrupt "thread" by borrowing the=20
context of the interrupted thread, we will have mostly the same thing as=20

The paragraph above actually is one selling point of what Andy's suggested
so far. :-) For architectures (platforms) with a PIC, my understanding is
the x86 change should be a model that can be cookie-cuttered around. There
are issues with modal architectures (ones with a separate interrupt
stack), so we aren't done yet. But we're getting there.

A big part of the win is that we can then make interrupt handlers not need=
=20
biglock. So we can then start fine-graining the kernel.

Oh, I like the idea of keeping the top/bottom split. I don't want to be=20
doing tons of work in an interrupt thread!

Take care,

Bill
To: <tech-kern@...>
Date: Friday, February 23, 2007 - 2:33 am

As I understand it, that model is highly optimized for bus-based
interrupt controllers and not particularly suitable for scaling to

It's the part that varies from "mostly" that concerns me. If it turns
out that the assumption of very low lock contention is wrong, the
cascading and convoying effect that results from contention will be

I think, as I said above, there are also issues with non-bus-like

A wise man once told me that "if you're doing more work in interrupt
context than you can comfortably code in assembler in an afternoon,
you're doing too much work in interrupt context." ;)
To: Bucky Katz <bucky@...>
Cc: <tech-kern@...>
Date: Friday, February 23, 2007 - 5:09 pm

However you are going to have to execute the cpu cycles at some point,
if the interrupt isn't going to schedule a process then you probably
want to avoid the cost of the scheduling the deferred code.

Remembers the ISR code to drive a stepper motor graph plotter under RSX/11M
(which expected the h/w ISR to just remove the IRQ and leave the rest of
the code to some os work queue...)

	David

-- 
David Laight: david@l8s.co.uk
To: Bucky Katz <bucky@...>
Cc: <tech-kern@...>
Date: Friday, February 23, 2007 - 5:43 pm

Not all of them. If you do a top/bottom design with lockless
synchronization then you don't ever have to yield in interrupt context

Great. I've been supressing RSX memories for decades. thanks for the
image ;)
To: Bucky Katz <bucky@...>
Cc: <tech-kern@...>
Date: Friday, February 23, 2007 - 6:23 pm

Coming up with good, machine-independent (MI) atomic primitives for
non-blocking synchronization is a real challenge. I recall debugging
an multiple-68040-based SMP machine, where we used the '040s to get
CAS2.  

I also invented LLP/SCP for MIPS (LL/LLP/SCP/SC lets you synthesize
CAS2 or atomic-double-queue operations) , though SGI (then-owner of
Mips) never quite got the r20k out :-/.

Point being, designing NBS algorithms for an MI OS kernel means we
need to choose MI atomic operations,and that's difficult.
(Bershad's Restartable Atomic  Sequences (RAS) are sometimes offered
here, but that's a featurefor userspace code, not  kernel-space code).

NBS is also a challenge for many otherwise-accomplished kernel
hackers (a problem shared by continuation-passing style).  Though,
combined with type-stable memory, or at least a type-stable prefix
in every object which is subject to NBS, can go a long way.

(I've seen comments in FreeBSD-6 headers which appear to do that,

TKB to you, too.
To: <tech-kern@...>
Date: Friday, February 23, 2007 - 8:00 pm

This is true. There's always lamport's algorithm, but that's hardly

You can do it without atomic operations, but it requires being able to
code some fairly sophisticated non-atomic stuff based on Lamport's

Yes.

The trick is to use non-atomic but fully ordered operations[1] in a
handshaking protocol. If done right, you make it a set of macros that
only have to be implemented once for each architecture and then you
rely on developers using those for communicating between the halfs.


[1] "fully ordered" in the weak sense that if processor A writes two
memory locations, first X and then Y, then processor B never sees the
new value of Y before it sees the new value of X.  You can do it with
weaker requirements, but the algorithms get more complicated.
To: <tech-kern@...>
Date: Thursday, February 22, 2007 - 9:30 pm

Is there, by any chance, a write up that describes the current
proposal?

A lot of the terminology here has been confusing me, but I thought
that was just because I've only recently come to this discussion. Now
I'm not so sure.
To: <jonathan@...>
Cc: Bill Studenmund <wrstuden@...>, Andrew Doran <ad@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 9:11 pm

Since the kernel is always mapped into the same location in every VM  
context, it doesn't matter.  He's not talking about switching VM  
context.

-- thorpej
To: Bill Studenmund <wrstuden@...>
Cc: <jonathan@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 4:54 pm

Exactly. We already context switch for interrupts, but it is not the same as
mi_switch. What I want to do is give the interrupt handler enough context
(curlwp, stack) that it can block briefly and be restarted later. There are
two outcomes: the interrupt runs to completion, or the handler blocks. In
both of those cases, we return back to the interrupted LWP just as we do
now.

The handlers would be permitted to block only in order to acquire a mutex or
RW lock. Calling cv_wait(), or lockmgr() or pool_get(, PR_WAITOK) etc. from
the handler's context would panic the machine.

When the lock a handler is waiting on is released, we end up in sleepq_wake.
The interrupt handler gets marked runnable and put onto a per-CPU run queue.
Just before returning, sleepq_wake notices that there is a high priority LWP
(above system or kernel priority) waiting to run and calls preempt. The
interrupt handler is the highest priority item in the run queue, so it gets
picked and put back on the CPU.

What Jonathan is describing is roughly how FreeBSD works, I think. When the
interrupt comes in, mi_switch() is called to dispatch it. The thread that
was running when the interrupt came in gets kicked off the CPU.

Andrew
To: <ad@...>
Cc: <wrstuden@...>, <jonathan@...>, <tech-kern@...>
Date: Friday, February 23, 2007 - 2:33 pm

In message: &lt;20070222205430.GA25309@hairylemon.org&gt;

At least until this morning... :-)

Filters were just committed to the tree that divide the ISR into two
parts.  A fast running part (the filter) that turns off the interrupt
source in the device or otherwise services the interrupt, and a
scheduled part (the ithread) that runs if the filter says to run it.
By default, ISRs with no filter always have their ithread run.  This
also eliminates the FAST_INTERRUPT special case because now all
filters are fast interrupt hanlders (in the 5.x and later sense, not
in the older 4.x and prior sense).

Warner
To: Andrew Doran <ad@...>
Cc: Bill Studenmund <wrstuden@...>, <jonathan@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 7:45 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



This is how solaris handles the interrupts as threads, just switching =20=

the stack and setting curlwp to get the
interrupt thread  running, essentially delaying parts of the context =20
switch to the rare case when the interrupt thread has to sleep.
A difference, worse to be mentioned, is that in solaris there is one =20
interrupt thread for each ipl for each cpu, while in freebsd there
is on for each "interrupt" (like irq1 irq3 on i386 which is not =20
necessarily the same as ipl).
The overhead of going thru mi_switch() for interrupt handling is =20
probably quite high, therefor it will be a lot better to
use this "lazy context switching" interrupt approach. Where in the =20
normal case where the interrupt does not have to block,
the overhead is quite small (judged by the 29 instructions mentioned =20
and the size of the isr).
In the case where the interrupt thread does not block, this is very =20
similar to conventional interrupt handling.
I don't agree with  "In both of those cases, we return back to the =20
interrupted LWP just as we do now"
as think this is only true for the non-blocking case in the blocking =20
case it will be that after the interrupt thread has completed its work
a new runnable thread will be choosen from the runqueues or do you =20
have a subtle difference in mind?

- --

Viele Gr=FC=DFe,
Lars Heidieker

lars@heidieker.de
http://paradoxon.info

- ------------------------------------

Mystische Erkl=E4rungen.
Die mystischen Erkl=E4rungen gelten f=FCr tief;
die Wahrheit ist, dass sie noch nicht einmal oberfl=E4chlich sind.
      -- Friedrich Nietzsche



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iD8DBQFF3isecxuYqjT7GRYRAggmAKDG64/y1f2IxaEG4N8dCsi7o49CzQCfbwqu
FHvd16w+1DIK2PxOxc7KtbA=3D
=3DSHnn
-----END PGP SIGNATURE-----
To: <jonathan@...>
Cc: Andrew Doran <ad@...>, Bill Studenmund <wrstuden@...>, Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 1:48 am

I'm not Andy, but as I understand it (and I mention this so I'll be=20
corrected if I'm wrong), the blocking case would be where another CPU=20
holds the mutex you need to acess the hardware. i.e. CPU 3 is configuring=
=20

I don't think that's a function of interrupts-as-threads. I think our=20
current code will do the same thing now, as will anything where one packet=
=20
=3D=3D one interrupt.

I think this is a reason for interrupt mitigation in NICs and, if=20
possible, in crypto cards. And/or for doing something completely different=
=20

Exactly.

Take care,

Bill
To: <jonathan@...>, Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 12:44 pm

In that case, CPU2 would busy-wait until it gets the mutex. Blocking
unconditionally would be expensive. In order to be able to release a mutex
without using atomic instructions, we actually can't easily block while a
running LWP holds it. They block in two cases:

- the owner is not running on a CPU anywhere in the system
- the owner is spinning on the kernel_lock; we yield to prevent deadlock

In most cases blocking should be the exception to the rule, but there are
situations where that kind of approach would kill us with context switching,
so in the interim we still need to ability to do:

	s = splfoo();
	mutex_enter(&amp;foo_mutex);
	... do something ...
	mutex_exit(&amp;foo_mutex);
	splx(s);

Longer term we need to deal with those kinds of concurrency problems on a
case-by-case basis.

The reader / writer locks always use blocking to synchronize, so they're not
really useful for synchronization from an interrupt handler. At a minimum,
you don't want to try and grab a write hold on the lock from an ISR; there
could be lots of readers to drain out before it's acquired.

Andrew
To: <jonathan@...>
Cc: Andrew Doran <ad@...>, Bill Studenmund <wrstuden@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Thursday, February 22, 2007 - 1:24 am

I think eventually we won't have this push-up-the-stack's-throat model  
anymore.  A better solution is for the NIC interrupt to simply  
schedule a puller-thread to grab the packet and process it to  

-- thorpej
To: <jonathan@...>
Cc: Andrew Doran <ad@...>, Bill Studenmund <wrstuden@...>, Jason Thorpe <thorpej@...>, Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 10:21 pm

You raise valid points.  Let me toss a countervailing point into the
mix: if the more expensive processing is done at by a thread, it
becomes possible to have multiple (software) IPsec threads -- one per
core -- for parallel decryptions, multiple ipinput/tcpinput threads,
etc.
To: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 10:14 pm

[ trimmed cc ]


Your modest 50,000 packets/sec would presumably be on hardware that
can do interrupt coalescing, right?  Even if your outline is correct,
I would hope that you're handling more than a single packet in each
interrupt / context switch, which should reduce the number of
context switches somewhat.

-allen

-- 
Allen Briggs  |  http://www.ninthwonder.com/~briggs/  |  briggs@ninthwonder.com
To: Allen Briggs <briggs@...>
Cc: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 10:39 pm

In message &lt;20070222021434.GJ19732@canolog.ninthwonder.com&gt;,

Yes, but I was assuming modest load, and thus little interrupt
mitigation.  With interrupt-mitigation in effect, that hardware can
handle more like 400,000 *non*-IPsec packets.  I don't have saturation
IPsec numbers at my fingertips; the accelerator I was then using was

If (dim) memory serves, the numbers I tossed out were from a setup
with a couple of Windows hosts with 100Mbit NICs using NIC-onboard
IPsec crypto offload.  I don't have access to any of that hardware
anymore, but (again) if memory serves, packets:interrupt ratil was
around 2:1.

If so, interrupt coalescing gets you only a factor of about 2;
and 150,000 context-switches a second are _still_ not your friend.

Heck, even 15,000 are painful, from where we're sitting now.

Sigh, I _really_ need to dig out a redistributable version of that
non-redistributable OCF userland test tool. (OTOH, it needs something
like a bcm5723 to hit the rates I'm tossing around...)

--Jonathan
Cc: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 7:59 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

AFAIK solaris use the kernel threads for interrupts we can look at the open
solaris sources how the solved this issue with SPARC CPU architecture.

Regards
- ---------------------------------------------------------------
Adam Hamsik
ICQ 249727910
jabber haad@jabber.org
- ---------------------------------------------------------------
There are 10 kinds of people in the world. Those who understand
binary numbers, and those who don't.
				
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (NetBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF3Nza9Wt2FT7y228RAnsUAJ4uftV3IK9O3FuB201QhzWCei8dYwCgiSta
QMa+SNfBEjNlZftbOmkCqeU=
=v6OQ
-----END PGP SIGNATURE-----
To: haad <haaaad@...>
Cc: <tech-kern@...>
Date: Thursday, February 22, 2007 - 12:51 pm

As a rule, I think we're better off not looking at CDDL/GPL licensed code
when implementing similar functionality. If you find some intractable
problem or you want to see what another system does when it comes to
standards, it can be handy to get hints. Otherwise it seems to me like
asking for copyright problems.

Cheers,
Andrew
To: Matt Thomas <matt@...>
Cc: <tech-kern@...>
Date: Wednesday, February 21, 2007 - 6:09 am

Hi Matt,



I gave this a lot of thought too. As a general solution, I really don't like
it because it is unnecessarily expensive, both in terms of execution time
and (perhaps more importantly) the effort involved in converting all of our
drivers to work this way. Conversely, the changes I have to handle
interrupts using LWPs add 29 instructions to a typical interrupt chain on
x86, to swap stack and curlwp. It works, and it's a solution that can just
be "dropped in".

To reiterate, there are two reasons I want to use LWPs to handle interrupts:
signficantly cheaper locking primitives on MP systems, and the ability to
eliminate the nasty deadlocks associated with interrupts/MP and interrupt
priority levels. The intent is *not* to rely heavily on blocking as the main
synchronization mechanism between the top and bottom halfs. That's why in
the near term I want to preseve the SPL system for places where it really
does matter. I did a lot of profiling to see where we would need to do this,

To add a bit of information about what I am proposing: the interrupt
priority levels continue to exist, and map onto scheduling priorities.
Interrupts at or below IPL_VM are provided with LWP context on execution,
above that level things would work (for the most part) as they do now,
with spinlocks being used to provide MP atomicity where needed.

To contrast a bit with the FreeBSD implementation, since it has been said we
are going to endure the same kind of performance degredation they have seen:
interrupts involve a lengthy path through C code, after which (from my
reading) a thread is scheduled to handle the interrupt, and the preempted
thread yields the CPU via mi_switch(). That involves taking multiple locks
along the way, touching lots of additional cache lines, making multiple
context switches and so on.

Cheers,
Andrew
To: Andrew Doran <ad@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 2:05 pm

The problem is I don't think it CAN be "dropped in" on a platform like  
VAX, m68k, or SPARC.  It's easy on x86 because of how simple its  
interrupt model is.  But I don't think that's true on platforms that  
have processor-defined interrupt levels and auto-vectored interrupts.

-- thorpej
To: Jason Thorpe <thorpej@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 6:59 pm

On sparc, the priority level is controlled by writing to a register on the
processor, and it's well defined - piece of cake. On x86 there is no way to
do that (at least, not one that will work on PCs in general). We end up
emulating the priority levels in software, because you can have different
sources, different priorities sharing lines. It's a pain! :-)

Andrew
To: Andrew Doran <ad@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 7:01 pm

On SPARC, it's really actually quite similar to m68k.  But more to the  
point, the priorities at which hardware interrupt the CPU are fixed  
(on some systems) and you actually have to enter the driver to de- 
assert the interrupt.

The x86 is "simple" precisely because of the PIC (you can effectively  
shut up the source by turning off that line at the PIC).

-- thorpej
To: Jason Thorpe <thorpej@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 7:44 pm

Right, but that is not what I am talking about doing.. Crossed wires? :)

Andrew
To: Andrew Doran <ad@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 1:50 pm

Funny, I was talking w/ Jason and Matt about this yesterday at Lunch.

There are two problems. Vax has both of them, and m68k has at least one of=
=20
them.

One problem is that some systems, like vax, are modal. There's a=20
difference running something in interrupt handling context and not. Matt=20
noted that the vax has separate interrupt stacks. So interrupt code is=20
more than just code running quickly (low latency) at high priority.

The other problem, which I know mac68k has too, is that you have to make=20
the hardware shut up as part of the interrupt handling. Otherwise once you=
=20
exit the interrupt, you'll just reenter it. So you have to have interrupts=
=20
remain disabled until this interupt handling thread completes. That's not=
=20


I think a good model would be something like how the z8530tty driver works
but dusted off. There is a hard interrupt handler that reads the chip. On
receive, it stuffs characters into a ring buffer and then triggers a soft
interrupt. Transmit, it stuffs characters into the chip.

Either way, the hard interrupt handler is small and just does pseudodma. A=
=20
software interrupt handling routine then comes along and does the heavy=20
lifting.

I really like the idea of the latter routine being a thread &amp; using=20
mutexes. The former, though, I think should remain a fast little routine.

Take care,

Bill
To: Bill Studenmund <wrstuden@...>
Cc: Matt Thomas <matt@...>, <tech-kern@...>
Date: Wednesday, February 21, 2007 - 5:12 pm

Hi Bill,


From what I understand, with the vax it's a implementation detail that can
be overcome, albeit a troublesome one. I'm sure there is a way to signal an
external agent to handle the interrupt proper. My understanding is that we
can continue to control the interrupt priority level while that is
happening. Well, we must be able to on some level, otherwise the spl

Ok, well we discussed this briefly offline, and my understanding is that the
problem is some m68k systems don't have a pic that we can use to control
interrupt sources. That's not a problem; as long as there is some limited
control over the interrupt priority level on the CPU itself, then we have a

Partly as an implementation convenience, and to avoid inversion, the
priority level must remain high. (I described this briefly back in December,
on tech-kern@). It is not something that should present a problem. In order
to complete servicing the interrupt when the handler blocks, we need to will
the ISR's scheduling priority to the LWP that is blocking it, and get
control of the CPU to that LWP in short order. Care needs to be taken to

The same way. While each architecture has its own peculiarities, they should
follow essentially the same pattern. Vax seems pretty unusual in this

For PDMA or networking I like that, where you have interrupts occuring at a
very high rate or where there are chokepoints as you funnel work down. As a
general solution I don't like it, and I can't think of any other modern Unix
that works exclusively that way. As I mentioned, it introduces an additional
burden both on the CPU and on developers.

Andrew
Previous thread: CPU 100% on lock on file write by Manuel Bouyer on Tuesday, February 20, 2007 - 4:59 pm. (4 messages)

Next thread: So I'll bite. M:N w/o SA - how? by Bill Studenmund on Wednesday, February 21, 2007 - 3:09 pm. (10 messages)
speck-geostationary