[PATCH] x86_64: Dynamically allocate arch specific system vectors

Previous thread: Heavy suspend and io problems in 2.6.27-rc1-00156-g94ad374 by Nico Schottelius on Thursday, July 31, 2008 - 2:05 pm. (11 messages)

Next thread: [git pull] x86 fixes by Ingo Molnar on Thursday, July 31, 2008 - 2:42 pm. (1 message)
From: Alan Mayer
Date: Thursday, July 31, 2008 - 2:22 pm

Subject: [PATCH] x86_64:  Dynamically allocate arch specific system vectors

From: Alan Mayer <ajm@sgi.com>

On some systems (e. g., UV) it is necessary to use an interrupt vector
as a "system" vector, that is, it is generated by system hardware, not an
IO device.  This patch dynamically allocates them from the pool of interrupt
vectors below the fixed system vectors.  This may include stealing some from
the device interrupt vector pool, so they are allocated dynamically so that
other archs don't have to pay the price.  In UV, examples of these hardware
and software systems that need dynamically allocated vectors are the GRU,
the BAU, and XPM/XPC.

Signed-off-by: Alan Mayer <ajm@sgi.com>

Reviewed by:  Robin Holt <holt@sgi.com> Dean Nelson <dcn@sgi.com> Cliff 
Wickman <cpw@sgi.com>

---
Index: temp/arch/x86/kernel/io_apic_64.c
===================================================================
--- temp.orig/arch/x86/kernel/io_apic_64.c	2008-07-31 11:53:58.000000000 
-0500
+++ temp/arch/x86/kernel/io_apic_64.c	2008-07-31 11:54:27.000000000 -0500
@@ -85,10 +85,6 @@

  static int assign_irq_vector(int irq, cpumask_t mask);

-int first_system_vector = 0xfe;
-
-char system_vectors[NR_VECTORS] = { [0 ... NR_VECTORS-1] = 
SYS_VECTOR_FREE};
-
  #define __apicdebuginit  __init

  int sis_apic_bug; /* not actually supported, dummy for compile */
@@ -783,7 +779,8 @@
  	 * Also, we've got to be careful not to trash gate
  	 * 0x80, because int 0x80 is hm, kind of importantish. ;)
  	 */
-	static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+	static int current_vector = FIRST_DYNAMIC_VECTOR;
+	static int current_offset = 0;
  	unsigned int old_vector;
  	int cpu;
  	struct irq_cfg *cfg;
@@ -817,10 +814,10 @@
  		offset = current_offset;
  next:
  		vector += 8;
-		if (vector >= first_system_vector) {
+		if (vector > last_dynamic_device_vector) {
  			/* If we run out of vectors on large boxen, must share them. */
  			offset = (offset + 1) % ...
From: Eric W. Biederman
Date: Thursday, July 31, 2008 - 3:10 pm

Could you please explain words why the vector allocator does not work
for you?

This code at first glance looks a lot like duplicating the vector allocator.

Eric
--

From: Cliff Wickman
Date: Friday, August 1, 2008 - 8:51 am

I assume you mean create_irq() and destroy_irq().
They are close to what we need.

However for any given system use:
- we need to request a high priority vector for some irq's, rather than
  one randomly allocated as per __assign_irq_vector().
- we want the irq/vector to be targeted to all cpu's (as specified in a
  mask,
  and can include currently offline cpu's) rather than a single cpu. 

I suppose those abilities could be added to create_irq(), but we didn't
want to intrude into that interface.

A smaller consideration is simplicity of use.  We want any such user to
use
the generic do_IRQ() flow (not alloc_intr_gate()).  But make it easy to
set
up the irq/vector, irq_chip and irq_desc without getting intimate with
the
details.
I suppose some other wrapper for an enhanced create_irq() could be done.

We are going to need such irq/vector pairs for a couple of UV drivers
(drivers/misc/sgi-gru/ and sgi-xp/).  And would prefer it for the UV TLB
shootdown (x86/kernel/tlb.uv.c) rather than using alloc_intr_gate().


-Cliff
-- 
Cliff Wickman
Silicon Graphics, Inc.
cpw@sgi.com
(651) 683-3824
--

From: Eric W. Biederman
Date: Friday, August 1, 2008 - 2:00 pm

Well I was actually thinking mostly about reusing assign_irq_vector.

The high priority vector seems to be the place where we would need



My primary concern is that you are generalizing an interface that is
designed for alloc_intr_gate consumers.  In particular for people
who wish to allocate vectors that can be triggered by software to
do this.

If this is normal irq handling I would really prefer to use the
normal data structures that we use for allocating vectors to irqs,

I also want to ask why do you need an irq that fires on every cpu?

That concept seems to be responsible for one of the nastiest data
structures in the kernel.  Where we track how many times any irq has
occurred on any cpu.  Look at the percpu variable kernel_stat.  It
seems to be one of the nastiest variables I know of.  Of size:
NR_CPUS*NR_CPUS*32  Assuming you have a machine with a reasonable
number of irqs per cpu.

Eric
--

From: Alan Mayer
Date: Friday, August 1, 2008 - 2:51 pm

Okay, I think we have it now.  assign_irq_vector *almost* does what we 
need.  One minor thing is that assign_irq_vector ANDs against 
cpu_online_map.  We would need cpu_possible_map, so we get the vector on 
offline cpus that may come online.  The other thing is that 
assign_irq_vector doesn't allow the specification of interrupt 
priorities.  It would need to be modified to handle returning either a 
high priority vector or a low priority vector.  Would modifying the api 
for assign_irq_vector be the proper approach?

The interrupts don't necessarily fire on all cpus, it's just that they 
*can* fire on any cpu.  For example, the GRU triggers an interrupt (it 
is very IPI'ish) to a particular cpu in the event of a GRU TLB fault. 
That cpu handles the fault and returns.  But the fault can happen on any 
cpu, so all cpus need to be registered for the same vector and irq. 
This is probably splitting hairs; it is certainly no different in 
principal from timer interrupts or processor TLB faults.

As far as kernel_stat is concerned.  I see you're point.  NR_CPUS on our 
machines is going to be big (4K? 8K? something like that).  NR_IRQS is 
also going to big because of that.  It's unfortunate since the actual 
number of interrupt sources is going to be an order of magnitude 
smaller, at least.

		--ajm


-- 
Make me an angel
That flies from Montgomery.
--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--
--

From: Eric W. Biederman
Date: Friday, August 1, 2008 - 3:14 pm

I don't know if it makes sense to modify assign_irq_vector or to 
have a companion function that uses the same data structures.

I think I would work on the companion function and if the code

Reasonable.  As long as you don't need to read a status register to figure
out what to do that sounds reasonable.  This does sound very much like
splitting hairs on a very platform specific capability.

If we can generalize the mechanism to things like per cpu timer
interrupts and such so that we reduced the total amount of code we

The number of interrupts sources is going to be smaller only because
SGI machines have or at least appear to have poor I/O compared to most
of the rest of machines in existence.  NR_CPUS*16 is a fairly
reasonable estimate on most machines in existence.  In the short term
it is going to get worse in the presence of MSI-X.  I was talking to a
developer at Intel last week about 256 irqs for one card.  I keep
having dreams about finding a way to just keep stats for a few cpus
but alas I don't think that is going to happen.  Silly us.

Eric
--

From: Alan Mayer
Date: Monday, August 4, 2008 - 12:37 pm

Okay, If I understand you, here's what we can do.  We currently have 
this function that does pretty much what the combination of create_irq() 
and __assign_irq_vector() do.  We can accomplish the same thing that our
routine does using create_irq() and __assign_irq_vector() do if we make 
the following changes:

__assign_irq_vector(int irq, cpumask_t mask) ==>
	__assign_irq_vector(int irq, cpumask_t mask, int priority);

priority has three values:  priority_none, priority_low, priority_high
priority_none means do everything the way it is done now.
priority_low means do everything the way its is done now, except use 
cpu_possible_map rather than cpu_online_map.
priority_high means search the interrupt vectors from the top down, 
rather than from the bottom up and use cpu_possible_map rather than 
cpu_online_map.

create_irq(void) ==> create_irq(int priority, cpumask_t *mask)
priority_none, means do everything the way it is done now, passing in 
TARGET_CPUS as the mask, but also sending the priority arg. into 
__assign_irq_vector().
priority_low and priority_high means use create_irq()'s mask arg. as the
mask passed to __assign_irq_vector).

We would add an additional small routine on top of create_irq() to do 
any massaging of the irq_desc, etc. that we need for these system vectors.

Is that what you were thinking about?


-- 
It's getting to the point
Where I'm no fun anymore.
--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--
--

From: Mike Travis
Date: Monday, August 4, 2008 - 1:39 pm

Checking to insure that at least one of the cpus is online seems to be
prudent, as well as what happens when the last online cpu in the group
goes offline?

Thanks,
--

Previous thread: Heavy suspend and io problems in 2.6.27-rc1-00156-g94ad374 by Nico Schottelius on Thursday, July 31, 2008 - 2:05 pm. (11 messages)

Next thread: [git pull] x86 fixes by Ingo Molnar on Thursday, July 31, 2008 - 2:42 pm. (1 message)