Re: [PATCH HACK] powerpc: quick hack to get a functional eHEA with hardirq preemption, eHCA is close

Previous thread: [PATCH] powerpc: ignore generated vmlinux.lds by Sebastien Dugue on Monday, September 15, 2008 - 12:50 am. (1 message)

Next thread: [PATCH] introduce boot_printk() by Yinghai Lu on Monday, September 15, 2008 - 1:05 am. (22 messages)
From: Sebastien Dugue
Date: Monday, September 15, 2008 - 1:04 am

WARNING: HACK - HACK - HACK

  Under the RT kernel (with hardirq preemption) the eHEA driver hangs right
after booting. Fiddling with the hardirqs and softirqs priorities allows to
run a bit longer but as soon as the network gets under load, the hang
returns. After investigating, it appears that the driver is loosing interrupts.

  To make a long story short, looking at the code, it appears that the XICS
maps all its interrupts to level sensitive interrupts (I don't know if it's the
reality or if it's due to an incomplete implementation - no datasheets
available to check) and use the fasteoi processing flow.

  When entering the low level handler, level sensitive interrupts are masked,
then eio'd in interrupt context and then unmasked at the end of hardirq
processing.
That's fine as any interrupt comming in-between will still be processed since
the kernel replays those pending interrupts.

  However, it appears that the eHEA interrupts are behaving as edge sensitive
interrupts and are routed through the XICS which process those as level
sensitive using the fasteoi handler __OR__ the XICS loses interrupts when they
are masked.

  Therefore the masking done in the handler causes any interrupt happening while
in the handler to be lost.

  So this patch maps the interrupts being requested through
ibmebus_request_irq() as edge sensitive interrupts (this concerns both the eHEA
and the eHCA - only users of ibmebus_request_irq()) and changes the way edge
interrupts are processed by the fasteoi handler.

  It works for the eHEA, dunno for the eHCA.

  So, unless all the designers of the XICS & eHEA have been shot to keep it
a secret, could someone knowledgeable shed some light on this issue.

  Thanks,

  Sebastien.

Not-Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
---
 arch/powerpc/kernel/ibmebus.c |   11 ++++++++++-
 kernel/irq/chip.c             |    5 +++--
 kernel/irq/manage.c           |    9 ++++++---
 3 files changed, 19 insertions(+), 6 ...
From: Thomas Klein
Date: Monday, September 15, 2008 - 5:35 am

Hi,

we are a bit worried about putting this into the mainstream part of non real
time linux. There interrupts work perfectly fine, and it was a bit of a
challenge to get there for all cases / configurations / machines.

Could you try to enable these changes only for RT-Linux via a real-time
kconfig switch? This way we make sure we don't break the scheme for
eHEA / eHCA.

Regards,
Jan-Bernd, Christoph



--

From: Sebastien Dugue
Date: Monday, September 15, 2008 - 6:13 am

Hi Thomas, Jan-Bernd, Christoph,


  Heck, I sure do not want this to be applied mainstream nor into any tree.
The sole purpose of this patch was to trigger some reaction from the people who

  Agreed, but the fact that it fails with hardirq preemption leads me to
believe (without any more knowledge about the harware) that there might be
something amiss with this driver (or the code concerning the XICS)

  Nope, this is just a quick hack that allows me to have a functional eHEA under
the rt kernel. I want to understand what the problem is:

  - Is the eHEA really delivering level interrupts to the XICS?

  - Is the XICS loosing interrupts when they are masked?


  Sure, I do not want to break anything, quite the opposite in fact ;-)


  Thanks,

--

From: Anton Vorontsov
Date: Tuesday, September 16, 2008 - 4:59 am

On Mon, Sep 15, 2008 at 03:13:32PM +0200, Sebastien Dugue wrote:

There is a known bug in the -rt kernels, the bug causes handlers
to lose edge interrupts.

See this patch:


-- 
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2
--

From: Sebastien Dugue
Date: Tuesday, September 16, 2008 - 5:22 am

Hi Anton,


  Yes, I've been following that thread back then and my hack is based on your
patch. So yes, it seems to be the same problem and it lies in the way -rt handles
the fasteoi flow.

  But looking at the comments from the XICS code, it seems that its wired for
level only interrupts. Therefore without any more specs, it's still not clear to
me that there's not a bug with the way the xics handles eHEA interrupts.

  Are the eHEA interrupts really level interrupts? If so why do they get lost
when masked?

  I just found that not masking that particular interrupt in the fasteoi path
makes the thing work!

  Thanks,

--

From: Christoph Raisch
Date: Thursday, September 18, 2008 - 12:53 am

This looks a bit like a set_irq_type call.
According to the specs at some point in the system the HEA IRQs have a edge
characteristic.
But since PCI-E edge and level can both be forwarded through a message
interface
(HEA is not PCI-E, it's only connected to the same internal bus, where the
PHB resides)

Anybody from the xics experts want to comment on this?



Gruss / Regards
Christoph R.  & Jan-Bernd

--

From: Sebastien Dugue
Date: Thursday, September 18, 2008 - 2:27 am

Good to know, the problem here seems to be that the xics is using the
fasteoi flow handler, which under unconditionally masks the irq. Will have to
dig in the genirq stuff a bit more to understand what the differences are

  It would be really interresting to know if the eHCA exhibits the same
problem under -rt as it's the only other user of the ibmebus.
Unfortunately I don't have the hardware to test.

  Thanks,

--

From: Christoph Raisch
Date: Thursday, September 18, 2008 - 3:42 am

eHCA is very close from the interrupt generation and handling perspective,
so yes, could be an issue there as well.


Gruss / Regards
Christoph Raisch


--

From: Sebastien Dugue
Date: Thursday, September 18, 2008 - 5:31 am

That's what I was speculating.

  Thanks,

--

From: Jan-Bernd Themann
Date: Tuesday, September 23, 2008 - 8:43 am

Hi,

I think these are the "functional" changes that need to be included in
the ibmebus driver. We'll add a RT flag in the final version to enable
these changes only for RT-Linux for now. 
Ben, can you / your team look into the implementation
of the set_irq_type functionality needed for XICS?

Regards,
Jan-Bernd & Christoph

diff -Nurp b/arch/powerpc/kernel/ibmebus.c a/arch/powerpc/kernel/ibmebus.c
--- b/arch/powerpc/kernel/ibmebus.c	2008-09-22 00:29:55.000000000 +0200
+++ a/arch/powerpc/kernel/ibmebus.c	2008-09-23 12:04:53.000000000 +0200
@@ -216,12 +216,16 @@ int ibmebus_request_irq(u32 ist, irq_han
 			unsigned long irq_flags, const char *devname,
 			void *dev_id)
 {
+	int ret;
 	unsigned int irq = irq_create_mapping(NULL, ist);
 
 	if (irq == NO_IRQ)
 		return -EINVAL;
 
-	return request_irq(irq, handler, irq_flags, devname, dev_id);
+	ret = request_irq(irq, handler, irq_flags, devname, dev_id);
+	set_irq_type(irq, IRQ_TYPE_EDGE_RISING);
+
+	return ret;
 }
 EXPORT_SYMBOL(ibmebus_request_irq);
 


--

From: Milton Miller
Date: Wednesday, September 24, 2008 - 2:58 am

I'm not volunteering to look at or implement any changes for how xics
works with generic irq, but I'm trying to understand what the rt kernel
is trying to accomplish with this statement:


Is this to generate some kind of software managed nesting and priority
of the hardware level interrupts?

The reason I ask is the xics controller can do unlimited nesting
of hardware interrupts.  In fact, the hardware has 255 levels of
priority, of which 16 or so are reserved by the hypervisor, leaving
over 200 for the os to manage.  Higher numbers are lower in priority,
and the hardware will only dispatch an interrupt to a given cpu if
it is currenty at a lower priority.  If it is at a higher priority
and the interrupt is not bound to a specific cpu it will look for
another cpu to dispatch it.  The hardware will not re-present an
irq until the it is EOId (managed by a small state machine per
interrupt at the source, which also handles no cpu available try
again later), but software can return its cpu priority to the
previous level to recieve other interrupt sources at the same level.
The hardware also supports lazy update of the cpu priority register
when an interrupt is presented; as long as the cpu is hard-irq
enabled it can take the irq then write is real priority and let the
hw decide if the irq is still pending or it must defer or try another
cpu in the rejection scenerio.  The only restriction is that the
EOI can not cause an interrupt reject by raising the priority while
sending the EOI command.

The per-interrupt mask and unmask calls have to go through RTAS, a
single-threaded global context, which in addition to increasing
path length will really limit scalability.  The interrupt controller
poll and reject facilities are accessed through hypervisor calls
which are comparable to a fast syscall, and parallel to all cpus.

We used to lower the priority to allow other interrupts in, but we
realized that in addition to the questionable latency in doing so,
it only caused unlimited ...
From: Benjamin Herrenschmidt
Date: Wednesday, September 24, 2008 - 3:17 am

Note also that the XICS code thus assumes, iirc, as does the cell IIC
code, that eoi is called on the -same- cpu that fetched the interrupt
initially. That assumption can be broken with IRQ threads no ?

Ben.


--

From: Milton Miller
Date: Wednesday, September 24, 2008 - 4:02 am

There may be some implicit assumption in that we expect the cpu 
priority to be returned to normal by the EOI, but there is nothing in 
the hardware that requires the EOI to come from the same cpu as 
accepted the interrupt for processing, with the exception of the IPI 
which is per-cpu (and the only interrupt that is per-cpu).

It would probably mean adding the concept of the current cpu priority 
vs interrupts and making sure we write it to hardware at irq_exit() 
time when deferring the actual irq handlers.

The MPIC hardware, on the other hand, maintains a queue of pending 
interrupts (It has been about a decade but the number 4-5 comes to 
mind), and the hardware must receive the EOI on the cpu that took it.  
I am guessing that the handling described (take level irq, mask it, eoi 
it, dispatch the thread, then unmask it after processing) is a result 
to work within those limitations.  Do you know the cell IIC to know if 
its mpic or xics in this regard?

The other unknown is the (very few) platforms that present as xics but 
are really firmware on mpic.  If they do a full virtual layer and don't 
take shortcuts but do eoi/mask like described here they should work, 
but I would not be surprised that does not hold true :-(.

milton

--

From: Benjamin Herrenschmidt
Date: Wednesday, September 24, 2008 - 2:14 pm

Well, there is one fundamental one: The XIRR register we access is
per-CPU, so if we are to return the right processor priority, we must
make sure we write the right XIRR.

Same with Cell, MPIC, actually and a few others. In general I'd say most

I think we need something like a special -rt variant of the fast_eoi
handler that masks & eoi's in ack() before the thread is spun off, and
unmasks instead of eoi() when the irq processing is complete.

Ben.


--

From: Sebastien Dugue
Date: Thursday, September 25, 2008 - 12:31 am

That's already the case as the irq fetch (xx_xirr_info_get()) and
eoi (xx_xirr_info_set()) are both done in interrupt context, therefore on

  This is what is already done in the threaded case:

    - fetch + mask + eoi  in interrupt context

    - unmask in the thread when processing is complete.


  Sebastien.

--

From: Sebastien Dugue
Date: Wednesday, September 24, 2008 - 5:35 am

Hi Ben,


  No, the fetch and the eoi are both done in interrupt context before
the hardirq thread is woken up.

  On the other hand, the mask+eoi and the unmask may well happen
on different cpus as there's only one hardirq thread per irq on
the system. Don't know if this is a problem with the XICS though.

  Thanks,

--

From: Benjamin Herrenschmidt
Date: Wednesday, September 24, 2008 - 2:15 pm

Ok, that's the right approach then. It should work. I don't know what
the specific problems with HEA are at this stage. It doesn't seem to
make sense to implement a set_irq_type(), what would it do ? The
XICS doesn't expose any concept of interrupt type...

Ben.

--

From: Sebastien Dugue
Date: Thursday, September 25, 2008 - 12:18 am

Yep, except as it behaves in way that the current -rt fasteoi flow

  That's what I gathered from looking at the sources.

  Thanks,

  Sebastien.

  

  

   
  
--

From: Benjamin Herrenschmidt
Date: Thursday, September 25, 2008 - 12:22 am

We probably need to make a special xics flow handler for -rt that does
what Milton suggested, ie, bring down the CPU priority right away and
only EOI later or something like that, instead of masking/unmasking.

I don't know what are the other potential issues with the HEA though.

Ben.


--

From: Sebastien Dugue
Date: Thursday, September 25, 2008 - 12:42 am

Do you mean creating a custom fasteoi handler into xics.c? Also, it's
not clear to me from looking at the code how you go about changing the

  Don't know either, but that I can test.

  Thanks,

  Sebastien.
--

From: Benjamin Herrenschmidt
Date: Thursday, September 25, 2008 - 1:36 am

Yup. I think the priority is the CPPR.. Milton can give you more
details, if not, I'll pick it up tomorrow when at the office.

Ben.

--

From: Sebastien Dugue
Date: Thursday, September 25, 2008 - 1:39 am

Thanks Ben, will look into this.

  Nite

  Sebastien.

--

From: Sebastien Dugue
Date: Wednesday, September 24, 2008 - 5:30 am

Hi Milton,


  No, not really. This is only to be sure to not miss interrupts coming
from the same source that were received during threaded hardirq processing.
Some instrumentation showed that it never seems to happen in the eHEA
interrupt case, so I think we can forget this aspect.

  Also, the problem only manifests with the eHEA RX interrupt. For example,
the IBM Power Raid (ipr) SCSI exhibits absolutely no problem under an RT
kernel. From this I conclude that:

  IPR - PCI - XICS is OK
  eHEA - IBMEBUS - XICS is broken with hardirq preemption.

  I also checked that forcing the eHEA interrupt to take the non threaded
path does work.


  Here is a side by side comparison of the fasteoi flow with and without hardirq
threading (sorry it's a bit wide)


					fasteoi flow:
					------------

	Non threaded hardirq			|			threaded hardirq
						|
   interrupt context				|	   interrupt context		hardirq thread
   -----------------				|	   -----------------		--------------
						|
						|
clear IRQ_REPLAY and IRQ_WAITING		|	clear IRQ_REPLAY and IRQ_WAITING
						|
increment percpu interrupt count		|	increment percpu interrupt count
						|
if no action or IRQ_INPROGRESS or IRQ_DISABLED	|	if no action or IRQ_INPROGRESS or IRQ_DISABLED
						|
  set IRQ_PENDING				|	  set IRQ_PENDING
						|
  mask						|	  mask
						|
  eoi						|	  eoi
						|
  done						|	  done
						|
set IRQ_INPROGRESS				|	set IRQ_INPROGRESS
						|
						|
						|	wakeup IRQ thread
						|
						|	mask
						|
						|	eoi
						|
						|	done		     --
						|			       \
						|				\
						|				 \
						|				  --> loop
						|
clear IRQ_PENDING				|				        clear IRQ_PENDING
						|
call handle_IRQ_event				|				        call handle_IRQ_event
						|
						|					check for prempt
						|
						|				      until IRQ_PENDING cleared
						|
						|
clear IRQ_INPROGRESS				|				      clear IRQ_INPROGRESS
						|
if not IRQ_DISABLED				|				      if not ...
From: Milton Miller
Date: Wednesday, September 24, 2008 - 9:42 am

I don't trust "the interrupt can never happen during hea hardirq", 
because I think there will be a race between their rearming the next 
interrupt and the unmask being called.

I was trying to understand why the mask and early eoi, but I guess its 
to handle other more limited interrupt controllers where the interrupts 

For a long period of time, XICS dealt only with level interrupts.   
First Micro Channel, and later PCI buses.  The IPI is made level by 
software conventions.  Recently, EHCA, EHEA, and MSI interrupts were 
added which by their nature are edge based.  The logic that converts 
those interrupts to the XICS layer is responsible for the resend when 

Hmm, I guess I'm confused.  You are saying the irq does not appear if 
it occurs while it is masked?  Well, in that case, I would guess that 
the hypervisor is checking if the irq is previously pending while it 
was masked and resetting it as part of the unmask.   It can't do it on 
level, but can on the true edge sources.  I would further say the 
justification for this might be the hardware might make it pending from 
some previous stale event that might result in the false interrupt on 


(I wrote this next paragraph before parsing the "remove mask and it 
works" / I'm confused paragraph above, so it may not be a problem).

These sources are truly edge.  Once you do an EOI you are taking 
responsibility to do the replay yourself.  In your threaded case, you 
EOI and therefore the hardware will arm for the next event.  When you 
add the mask, the delivery is deferred until it is unmasked at the end 
of your EOI loop.  When you do not, the new interrupt may come in but 
you just EOI it but do not tell the running thread that it happened, 
then you are dropping the irq event.   Since the source is truly edge, 
there is no hardware replay and the interrupt is lost.

(I think the pci express gigabit is one of the few msi interrupt 

While the hypervisor adds a bit of path length (an hcall vs a single 
mmio ...
From: Benjamin Herrenschmidt
Date: Wednesday, September 24, 2008 - 2:16 pm

No Milton, we must do it that way, because the EOI must be done on the
right CPU even on XICS, or we won't get the CPU priority back properly.

Ben.

--

From: Milton Miller
Date: Wednesday, September 24, 2008 - 8:56 pm

Ben and I had a online chat, and he pointed out I needed to be more 


cpu takes interrupt, checks soft disabled
if so,
	set hard disabled
else
	call get_irq
	if threaded
		write cppr to restore this cpu irq dispatch state to non-interrupt
		mark irq thread as irq pending
	else
		handle interrupt
		eoi (cppr = base)

irq thread will
	handle interrupt
	eoi
	wait for marked pending again

The part Ben did not follow was that the cppr write to base priority is 
done by the interrupted cpu (like the mask and eoi in the current flow) 
and only the final eoi (where the mask is in the existing flow) is done 
on which ever cpu happens to run the irq thread.


(optional) As I was discussing with Paul, when taking an irq when 
soft-disabled but still hard enabled, it is possible to write the cppr 
such that it would reject the pending irq and have it be considered for 
dispatch to another cpu.   But it would increase pathlength on both the 
go-to-hard-disabled and return-from-hard-disabled and the hardware will 
have some latency as it will likely send it back to the io source until 
it retrys, so we would only want to do this if the hard-disable period 
is sufficiently long.

milton

--

From: Sebastien Dugue
Date: Thursday, September 25, 2008 - 1:45 am

So do I, it was just to make sure I was not hit by another interrupt while
handling the previous one and thus reduce the number of hypothesis.

  I sure do not say that it cannot happen, just that that path is not taken


  Whoops, my bad, in the non threaded case, there's no mask at all, only an

  Looks like it is, but I cannot say for sure, the only observable effect


  That may be, but I'm only looking at the code (read no specifications at hand)

  Now, that is quite interesting then. Those mask() and unmask() should then
be called shutdown() and startup() and not at each interrupt or am I

  This confirms, then, that the mask and unmask methods should be empty





        ^



  Thanks a lot Milton for those explanations,


  Sebastien.







--

Previous thread: [PATCH] powerpc: ignore generated vmlinux.lds by Sebastien Dugue on Monday, September 15, 2008 - 12:50 am. (1 message)

Next thread: [PATCH] introduce boot_printk() by Yinghai Lu on Monday, September 15, 2008 - 1:05 am. (22 messages)