Re: [GIT PULL tip/genirq] Please pull from lost-spurious-irq

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Tejun Heo
Date: Monday, August 2, 2010 - 8:28 am

Hello, Thomas.

On 08/02/2010 04:07 PM, Thomas Gleixner wrote:

Oh, I'll explain why I muliplexed timers this way below.


Heh, fair enough.  Let's talk about it.


Do you mean spurious irq polling here?


I suppose you're talking about using per-desc timer.  I first thought
about using a single common timer too but decided against it because
timer events can be merged from timer code (depending on the interval
and slack) and I wanted to use different polling frequencies for
different cases.  We can of course implement logic to put polled irqs
on a list in chronological order but that's basically redoing the work
timer code already does.


Because in many cases IRQ storms are transient and spurious IRQ
polling worsens performance a lot, it's worthwhile to be able to
recover from such conditions, so the extra time period is there to
trigger reenabling of the offending IRQ to see whether the storm is
over now.  Please note that many of this type of IRQ storms are
extremely obscure (for example, happens during resume from STR once in
a blue moon due to bad interaction between legacy ATA state machine in
the controller and the drive's certain behavior) and some are very
difficult to avoid from the driver in any reasonable way.


Or maybe you were talking about something else?


In simulated tests with hosed IRQ routing the behavior was already
quite acceptable, but yeah if peeking at interrupt state can be done
easily that would definitely be an improvement.


Eh, I thought I did pretty good on documenting each mechanism.
Obviously, not enough at all.  :-)

It basically allows drivers to tell the irq code that IRQs are likely
to happen.  In response, IRQ code polls the desc for a while and sees
whether poll detects IRQ handling w/o actual IRQ deliveries.  If that
seems to happen more than usual, IRQ code can determine that IRQ is
not being delivered and enables full blown IRQ polling.

The WARY state is inbetween state between good and bad.  Dropping this
will simplify the logic a bit.  It's basically to avoid false
positives as it would suck to enable polling for a working IRQ line.

Anyways, the basic heuristics is it watches certain number of IRQ
events and count how many are good (via IRQ) and bad (via poll).  If
bad goes over certain threshold, polling is enabled.


The whole thing is executed iff unlikely(desc->status &
IRQ_CHECK_WATCHES) which the processor will be predicting correctly
most of the time.  Do you think it would make any noticeable
difference?


The reason why poll_irq() looks complicated is because different types
of timers on the same desc shares the timer.  The reason why it's
shared this way instead of across different descs of the same type is
to avoid locking overhead in maintaining the timer and linked lists of
its targets.  By sharing the timer in the same desc, everything can be
protected by desc->lock but if we use timers across different descs,
we need another layer of locking.


Delays caused by lost interrupt and by other failures can differ a lot
in their magnitude and, for example, libata does check IRQ delivery
failure in its timeout handler but at that point it doesn't really
matter why the timeout has occurred for recovery of that specific
command.  Too much time has passed anyway.

We definitely can implement such quick polling timeouts which check
for IRQ misdeliveries in drivers so that it can detect such events and
maybe add an interface in the IRQ polling code which the driver can
call to enable polling on the IRQ.  But then again the pattern of such
failures and handling of them would be very similar across different
drivers and we already of most of polling machinary implemented in IRQ
code, so I think it makes sense to have a generic implementation which
drivers can use.  Moreover, given the difference in test/review
coverage and general complexity of doing it right, I think it is far
better to do it in the core code and make it easy to use for drivers.


In usual operation conditions, the timer interval will quickly become
3 seconds w/ 1 sec slack.  I doubt it can cause much problem.  Right?


The same thing as above.  If IRQ delivery is working, it will be
polling every 3 seconds w/ 1 sec slack.  It wouldn't really matter.


That's to avoid the overhead of constantly manipulating the timer for
high frequency commands.  expect/unexpect fast paths don't do much, no
locking, no timer manipulations.  No matter how busy the IRQ is, the
timer will only activate and rearmed occassionally.


The main problem here is that expect/unexpect_irq() needs to be low
overhead so that drivers can easily call in w/o worrying too much
about performance overhead and I would definitely want to avoid
locking in fast paths.


To me, the more interesting question is where to put the complexity.
In many cases, the most we can do is pushing complexity around, and if
we can make things easier for drivers by making the core IRQ code
somewhat more complex, I'll definitely go for that every time.

That's also the main theme of this IRQ spurious/missing patchset.  The
implementation might be a bit complex but using it from the driver
side is almost no brainer.  The drivers just need to give hints to the
IRQ polling code and the IRQ polling code will take care of everything
else, and the low overhead for irq_expect/unexpect() is important in
that aspect because drivers can only use it without worrying iff its
overhead is sufficiently low.

That said, I definitely think poll_irq() can be better factored.  Its
current form is mainly due to its transition from spurious poller to
the current multiplexed form.  Factoring out different parts and
possibly killing watch handling should simplify it quite a bit.

Thank you.

-- 
tejun
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [GIT PULL tip/genirq] Please pull from lost-spurious-irq, Thomas Gleixner, (Thu Jul 29, 1:44 am)
Re: [GIT PULL tip/genirq] Please pull from lost-spurious-irq, Tejun Heo, (Mon Aug 2, 8:28 am)
Re: [GIT PULL tip/genirq] Please pull from lost-spurious-irq, Thomas Gleixner, (Mon Aug 2, 10:10 am)
Re: [GIT PULL tip/genirq] Please pull from lost-spurious-irq, Thomas Gleixner, (Mon Aug 2, 11:52 am)