Re: [RFC 0/4] dynamically allocate arch specific system vectors

Previous thread: [PATCH] cgroup.c: Some 'hlist_head' function fixes. by Rakib Mullick on Friday, August 8, 2008 - 8:29 am. (2 messages)

Next thread: [2.6 patch] fix watchdog/shwdt.c compilation by Adrian Bunk on Friday, August 8, 2008 - 8:39 am. (1 message)
From: Alan Mayer
Date: Friday, August 8, 2008 - 8:37 am

Subject: [PATCH] x86_64:  (NEW) Dynamically allocate arch specific 
system vectors

From: Alan Mayer <ajm@sgi.com>

On some systems (e. g., UV) it is necessary to use an interrupt vector
as a "system" vector, that is, it is generated by system hardware, not an
IO device.  This patch dynamically allocates them from the pool of interrupt
vectors below the fixed system vectors.  This may include stealing some from
the device interrupt vector pool, so they are allocated dynamically so that
other archs don't have to pay the price.  In UV, examples of these hardware
and software systems that need dynamically allocated vectors are the GRU,
the BAU, and XPM/XPC.

Signed-off-by: Alan Mayer <ajm@sgi.com>

Reviewed by:  Robin Holt <holt@sgi.com> Dean Nelson <dcn@sgi.com> Cliff 
Wickman <cpw@sgi.com>

---
Index: linuxnext.latest/arch/x86/kernel/io_apic_64.c
===================================================================
--- linuxnext.latest.orig/arch/x86/kernel/io_apic_64.c	2008-08-07 
09:46:37.000000000 -0500
+++ linuxnext.latest/arch/x86/kernel/io_apic_64.c	2008-08-07 
13:26:18.000000000 -0500
@@ -85,10 +85,6 @@

  static int assign_irq_vector(int irq, cpumask_t mask);

-int first_system_vector = 0xfe;
-
-char system_vectors[NR_VECTORS] = { [0 ... NR_VECTORS-1] = 
SYS_VECTOR_FREE};
-
  #define __apicdebuginit  __init

  int sis_apic_bug; /* not actually supported, dummy for compile */
@@ -770,7 +766,7 @@
  	return irq;
  }

-static int __assign_irq_vector(int irq, cpumask_t mask)
+static int __assign_irq_vector(int irq, int priority, cpumask_t *mask)
  {
  	/*
  	 * NOTE! The local APIC isn't very good at handling
@@ -783,63 +779,99 @@
  	 * Also, we've got to be careful not to trash gate
  	 * 0x80, because int 0x80 is hm, kind of importantish. ;)
  	 */
-	static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+	static int current_device_vector = FIRST_DYNAMIC_VECTOR;
+	static int current_device_offset;	/* initially 0 */
+	int ...
From: Ingo Molnar
Date: Monday, August 11, 2008 - 9:59 am

patch has severe inlined-as-text corruption, please check 
Documentation/email-clients.txt about how to send patches. (or send it 
as an attachment, i can process that)

Also, given the extensive feedback from Eric, it would be nice to have 
his Acked-by line as well to any patch that is resubmitted for 
inclusion, to make sure you meet all the requirements he has outlined.

Thanks,

	Ingo
--

From: Alan Mayer
Date: Monday, August 11, 2008 - 10:14 am

Okay, here it is as an attachment.  I think my email client is munging it.
I haven't been able to fix it, apparently.

I, too, would like to know what Eric thinks.

		--ajm


-- 
I know
it's only rock and roll,
But I like it.
--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--
From: Eric W. Biederman
Date: Monday, August 11, 2008 - 12:39 pm

I think arch/x86 is about to fall over from accidental complexity of
the irq handling.  Looking at your problem and the problem of killing
NR_IRQS I spent way to much time playing with it this weekend then
I should have, but I think I have found a path that works and is
fairly easily verifiable.

The short version is we make vector_irq the one repository of knowledge
about what we are doing with vectors.

We create a common factor of assign_irq_vector that looks something like:

bool __grab_irq_vector(struct irq_desc *desc, unsigned vector, cpumask_t new_domain)
{
        /* Must be called with vector lock */
        struct irq_cfg *cfg;
        bool grabbed = false;
        unsigned int old_vector;
        cpumask_t mask;
        int cpu;

        cfg = get_irqp_cfg(irq);
        old_vector = cfg->vector;
        cpus_and(mask, new_domain, cpu_online_map);

        for_each_cpu_mask_nr(cpu, mask) {
		if (per_cpu(vector_irq, cpu)[vector])
                	goto out;
        }
        /* Available reserve it */
        for_each_cpu_mask_nr(cpu, mask)
  	      per_cpu(vector_irq, cpu)[vector] = desc;
        if (cfg->vector) {
        	cfg->move_in_progress;
                cfg->old_domain = cfg->domain;
        }
        cfg->vector = vector;
        cfg->domain = mask;
        grabbed = true;
        
out:
        return grabbed;
}

Then in your allocator for per cpu irqs you can do:
spin_lock(&vector_lock);
for (vector = FIRST_VECTOR; vector != LAST_VECTOR, vector--) {
	if (__grab_irq_vector(desc, CPU_MASK_ALL))
        	goto found;
}
spin_unlock(&vector_lock);

Although I am not at all convinced that dynamic allocation of
the vector number (instead of statically reserving it makes sense).
The only way I can see to guarantee all of the special is to
statically allocate them with a lot of good comments.  I think
the introduction of system_vectors quite likely defeated the
errata work around we have the lapic timer in a separate priority.

Still if we ...
From: Ingo Molnar
Date: Monday, August 11, 2008 - 12:51 pm

it was in that state for many years already ;-) Unification, cleanups of 
other historic messes and the constant push for new hw support just made 
it stand out more visibly. IRQ and APIC code unification is definitely 


i'm very interested in it, even if it's incomplete and wont build/boot 
at all. So please consider posting your existing incomplete series as an 
RFC right now, maybe we can help finish it sooner than you will find the 
time? We can put it into a new tip/x86/irq-unification branch, without 
merging it into tip/master just yet.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Monday, August 11, 2008 - 12:55 pm

Seconded.  This is the area which most concerns me wrt Xen dom0 
integration, and getting it right should make the Xen-specific parts 
turn out as minor adjuncts.

    J
--

From: Eric W. Biederman
Date: Monday, August 11, 2008 - 1:10 pm

Well I think we are looking at the final couple of straws...

Yes unification (aka porting the infrastructure improvements

Sure. Mostly I just need to allocate a few minutes to post it, later today
or tomorrow hopefully.

Eric


--

From: Alan Mayer
Date: Monday, August 11, 2008 - 1:02 pm

Okay, so we'll wait for Eric to send out his patch and work off that.



Our system requires some extra system vectors.  They are meaningless on
other systems.  So, rather than statically allocate them for everyone

If I can get a sense of where you're headed with your patch and you don't mind,
maybe I can do the last 10%.

		--ajm

-- 
Somebody just stopped callin' you "Angel."
--
Alan J. Mayer
SGI
ajm@sgi.com
WORK: 651-683-3131
HOME: 651-407-0134
--
--

From: Dean Nelson
Date: Thursday, September 11, 2008 - 8:23 am

We (SGI) need somewhere around eight vectors.

There are two kernel modules, sgi-gru and sgi-xp (in drivers/misc), that
each need two vectors. And there's the broadcast assist unit (BAU) that is
involved in tlb shootdown on uv, which currently uses statically reserved
vector 0xf8 (UV_BAU_MESSAGE -- see uv_bau_init()). I know of a debugger that
also uses 0xf8 because it was previously available until UV_BAU_MESSAGE came
along. The BAU would be happy with a dynamically allocated system vector.
We have a couple of other things in the works that also need vectors.

All of these eight or so vectors are only meaningful on SGI uv systems.

We'll go with which ever way you decide on this (dynamically allocated or
statically reserved).

But we really need to get something into 2.6.28. So in order to move forward
I'm submitting the following patchset for comment. This set of four patches
represents my take on Eric's suggested approach to supporting dynamically
allocated system vectors (i.e., __grab_irq_vector()).

Eric, is this what you were thinking of? Or did I miss the mark?

Thanks,
Dean
--

From: Dean Nelson
Date: Thursday, September 11, 2008 - 8:25 am

Change per_cpu variable vector_irq[] from holding an 'int' irq number to
holding a 'struct irq_desc' pointer.

Signed-off-by: Dean Nelson <dcn@sgi.com>

---

Note that this pre-allocates the irq_desc structure before we know whether
a vector will be successfully found. And should it not, the irq_desc structure
is left created with the irq still unallocated. Should someone in the future
attempt to allocate a vector with that same irq and succeed, they will get
the previously allocated irq_desc structure.

Also, I won't claim the changes to arch/x86/xen/irq.c were correctly done, and
I know they weren't tested.

 arch/x86/kernel/io_apic.c     |   32 ++++++++++++++++++--------------
 arch/x86/kernel/irq_32.c      |   14 ++++++--------
 arch/x86/kernel/irq_64.c      |   10 ++++------
 arch/x86/kernel/irqinit_32.c  |   29 +++++++++--------------------
 arch/x86/kernel/irqinit_64.c  |   29 +++++++++--------------------
 arch/x86/kernel/vmiclock_32.c |    2 +-
 arch/x86/xen/irq.c            |    3 ++-
 include/asm-x86/hw_irq.h      |    2 +-
 8 files changed, 50 insertions(+), 71 deletions(-)

Index: linux/arch/x86/kernel/io_apic.c
===================================================================
--- linux.orig/arch/x86/kernel/io_apic.c	2008-09-09 12:57:08.000000000 -0500
+++ linux/arch/x86/kernel/io_apic.c	2008-09-10 09:21:29.000000000 -0500
@@ -1222,6 +1222,7 @@ static int __assign_irq_vector(int irq, 
 	unsigned int old_vector;
 	int cpu;
 	struct irq_cfg *cfg;
+	struct irq_desc *desc;
 
 	cfg = irq_cfg(irq);
 
@@ -1239,6 +1240,8 @@ static int __assign_irq_vector(int irq, 
 			return 0;
 	}
 
+	desc = irq_to_desc_alloc(irq);
+
 	for_each_cpu_mask_nr(cpu, mask) {
 		cpumask_t domain, new_mask;
 		int new_cpu;
@@ -1266,7 +1269,7 @@ next:
 			goto next;
 #endif
 		for_each_cpu_mask_nr(new_cpu, new_mask)
-			if (per_cpu(vector_irq, new_cpu)[vector] != -1)
+			if (per_cpu(vector_irq, new_cpu)[vector] != NULL)
 				goto next;
 		/* Found one! */
 		current_vector ...
From: Dean Nelson
Date: Thursday, September 11, 2008 - 8:27 am

Introduce the dynamic allocation and deallocation of system vectors which
are mapped to irq numbers allowing the use of request_irq()/free_irq().

Signed-off-by: Dean Nelson <dcn@sgi.com>

---

 arch/x86/kernel/apic.c        |    3 
 arch/x86/kernel/io_apic.c     |  264 +++++++++++++++++++++++++++++++++-----
 arch/x86/kernel/irqinit_64.c  |    4 
 include/asm-x86/desc.h        |   13 +
 include/asm-x86/irq_vectors.h |    1 
 include/linux/irq.h           |   13 +
 6 files changed, 258 insertions(+), 40 deletions(-)

Index: linux/arch/x86/kernel/io_apic.c
===================================================================
--- linux.orig/arch/x86/kernel/io_apic.c	2008-09-10 12:08:46.000000000 -0500
+++ linux/arch/x86/kernel/io_apic.c	2008-09-11 07:17:33.000000000 -0500
@@ -1205,7 +1205,34 @@ void unlock_vector_lock(void)
 	spin_unlock(&vector_lock);
 }
 
-static int __assign_irq_vector(int irq, cpumask_t mask)
+bool __grab_irq_vector(struct irq_desc *desc, unsigned int vector,
+		       cpumask_t *new_domain_mask)
+{
+	/* Must be called with vector lock */
+	struct irq_cfg *cfg;
+	int cpu;
+
+	for_each_cpu_mask_nr(cpu, *new_domain_mask) {
+		if (per_cpu(vector_irq, cpu)[vector] != NULL)
+			return false;
+	}
+
+	/* Available reserve it */
+	for_each_cpu_mask_nr(cpu, *new_domain_mask)
+		per_cpu(vector_irq, cpu)[vector] = desc;
+
+	cfg = irq_cfg(desc->irq);
+	if (cfg->vector) {
+		cfg->move_in_progress = 1;
+		cfg->old_domain = cfg->domain;
+	}
+	cfg->vector = vector;
+	cfg->domain = *new_domain_mask;
+
+	return true;
+}
+
+static int __assign_irq_vector(int irq, cpumask_t *mask)
 {
 	/*
 	 * NOTE! The local APIC isn't very good at handling
@@ -1219,42 +1246,40 @@ static int __assign_irq_vector(int irq, 
 	 * 0x80, because int 0x80 is hm, kind of importantish. ;)
 	 */
 	static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
-	unsigned int old_vector;
+	cpumask_t target_cpus_mask;
 	int cpu;
 	struct irq_cfg *cfg;
 	struct irq_desc ...
From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:39 am

please put it into arch/x86/kernel/irq.c or so.


does not belong into the generic kernel code - it's an x86 property.

	Ingo
--

From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:46 am

hm. One SGI developer (Mike Travis) is busy removing on-kernel-stack 


... functions.

	Ingo
--

From: Dean Nelson
Date: Thursday, September 11, 2008 - 8:28 am

Replace the current use of system_vectors[] for the allocation of static
system vectors by also using the per_cpu variable vector_irq[].

Signed-off-by: Dean Nelson <dcn@sgi.com>

---

 arch/x86/kernel/apic.c    |    2 --
 arch/x86/kernel/io_apic.c |   22 ++++++++++++----------
 arch/x86/kernel/irq_32.c  |    2 +-
 arch/x86/kernel/irq_64.c  |    2 +-
 include/asm-x86/desc.h    |   21 +++++++++++++--------
 include/asm-x86/irq.h     |    2 ++
 include/linux/irq.h       |    5 +++++
 7 files changed, 34 insertions(+), 22 deletions(-)

Index: linux/arch/x86/kernel/apic.c
===================================================================
--- linux.orig/arch/x86/kernel/apic.c	2008-09-10 12:09:43.000000000 -0500
+++ linux/arch/x86/kernel/apic.c	2008-09-10 12:10:35.000000000 -0500
@@ -119,8 +119,6 @@ EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok
 int first_static_system_vector = 0xfe;
 int last_device_vector = 0xfd;
 
-char system_vectors[NR_VECTORS] = { [0 ... NR_VECTORS-1] = SYS_VECTOR_FREE};
-
 /*
  * Debug level, exported for io_apic.c
  */
Index: linux/include/asm-x86/desc.h
===================================================================
--- linux.orig/include/asm-x86/desc.h	2008-09-10 12:09:43.000000000 -0500
+++ linux/include/asm-x86/desc.h	2008-09-10 12:10:35.000000000 -0500
@@ -6,6 +6,7 @@
 #include <asm/ldt.h>
 #include <asm/mmu.h>
 #include <linux/smp.h>
+#include <linux/irq.h>
 
 static inline void fill_ldt(struct desc_struct *desc,
 			    const struct user_desc *info)
@@ -329,14 +330,18 @@ extern char system_vectors[];
 
 static inline void alloc_static_system_vector(int vector)
 {
-	if (system_vectors[vector] == SYS_VECTOR_FREE) {
-		system_vectors[vector] = SYS_VECTOR_ALLOCED;
-		if (first_static_system_vector > vector)
-			first_static_system_vector = vector;
-		if (last_device_vector > vector - 1)
-			last_device_vector = vector - 1;
-	} else
-		BUG();
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&vector_lock, flags);
+	ret = ...
From: Dean Nelson
Date: Thursday, September 11, 2008 - 8:29 am

Replace the current use of used_vectors[] for the allocation of a non-standard
SYSCALL_VECTOR by also using the per_cpu variable vector_irq[].

Signed-off-by: Dean Nelson <dcn@sgi.com>

---

 arch/x86/kernel/traps_32.c            |   17 +++++++++++------
 drivers/lguest/interrupts_and_traps.c |   28 ++++++++++++++++++++++------
 include/asm-x86/irq.h                 |    3 ---
 3 files changed, 33 insertions(+), 15 deletions(-)

Index: linux/arch/x86/kernel/traps_32.c
===================================================================
--- linux.orig/arch/x86/kernel/traps_32.c	2008-09-10 14:25:23.000000000 -0500
+++ linux/arch/x86/kernel/traps_32.c	2008-09-10 14:25:28.000000000 -0500
@@ -63,9 +63,6 @@
 
 #include "mach_traps.h"
 
-DECLARE_BITMAP(used_vectors, NR_VECTORS);
-EXPORT_SYMBOL_GPL(used_vectors);
-
 asmlinkage int system_call(void);
 
 /* Do we ignore FPU interrupts ? */
@@ -1189,6 +1186,8 @@ asmlinkage void math_emulate(long arg)
 void __init trap_init(void)
 {
 	int i;
+	bool ret;
+	unsigned long flags;
 
 #ifdef CONFIG_EISA
 	void __iomem *p = early_ioremap(0x0FFFD9, 4);
@@ -1236,10 +1235,16 @@ void __init trap_init(void)
 	set_system_gate(SYSCALL_VECTOR, &system_call);
 
 	/* Reserve all the builtin and the syscall vector: */
-	for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
-		set_bit(i, used_vectors);
+	spin_lock_irqsave(&vector_lock, flags);
+	for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++) {
+		ret =__grab_irq_vector(NON_IRQ_DESC, i, &cpu_possible_map);
+		BUG_ON(ret == false);
+	}
 
-	set_bit(SYSCALL_VECTOR, used_vectors);
+	ret = __grab_irq_vector(NON_IRQ_DESC, SYSCALL_VECTOR,
+				&cpu_possible_map);
+	BUG_ON(ret == false);
+	spin_unlock_irqrestore(&vector_lock, flags);
 
 	/*
 	 * Should be a barrier for any external CPU state:
Index: linux/include/asm-x86/irq.h
===================================================================
--- linux.orig/include/asm-x86/irq.h	2008-09-10 14:25:23.000000000 -0500
+++ ...
From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:40 am

break the build if IO_APIC is disabled. (see previous mail)

	Ingo
--

From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:42 am

please use __grab_irq_vector() to standard return type: -EINVAL (or 
-EBUSY) on failure, vector on success. That will get rid of the 'ret == 
false' ugliness too.

	Ingo
--

From: H. Peter Anvin
Date: Thursday, September 11, 2008 - 1:04 pm

Are these kernel-internal vectors, or exposed to userspace (i.e. the INT 
instruction works in userspace)?  From what I'm gathering, I think this 
is the former.

There are the occational user who wants fixed user-space addressible 
vectors, effectively as secondary system call entry vectors (lguest is 
the only in-kernel user that I know of); those need to be static and 
non-conflicting, which is kind of difficult.  It seems to me that 
anything else probably should be possible to be dynamic.

	-hpa
--

From: Dean Nelson
Date: Friday, September 12, 2008 - 4:46 am

Yeah, these are kernel-internal vectors and are not exposed to userspace.
--

From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:35 am

while i understand the UV_BAU_MESSAGE case (TLB flushes are special), 
why does sgi-gru and sgi-xp need to go that deep? They are drivers, they 
should be able to make use of an ordinary irq just like the other 2000 

which debugger is this?

	Ingo
--

From: Ingo Molnar
Date: Sunday, September 14, 2008 - 8:48 am

... but all in one, i still like this concept as it's a nice clean-up. I 
just dont think it should be exposed to generic drivers, and i think it 
needs further fixes and cleanups. (see my previous mails)

	Ingo
--

From: Dean Nelson
Date: Monday, September 15, 2008 - 2:50 pm

The sgi-gru driver needs to be able to allocate a single irq/vector pair for
all CPUs even those that are not currently online. The sgi-xp driver has
similar but not as stringent needs.

The current __assign_irq_vector() restricts the allocation of the irq/vector

KDB.
--

From: Ingo Molnar
Date: Tuesday, September 16, 2008 - 1:24 am

why does it need to allocate a single irq/vector pair? Why is a regular 
interrupt not good?

	Ingo
--

From: Dean Nelson
Date: Tuesday, September 16, 2008 - 1:46 pm

When you speak of a 'regular interrupt' I assume you are referring to simply
the irq number, with the knowledge of what vector and CPU(s) it is mapped to
being hidden?


    sgi-gru driver

The GRU is not an actual external device that is connected to an IOAPIC.
The gru is a hardware mechanism that is embedded in the node controller
(UV hub) that directly connects to the cpu socket. Any cpu (with permission)
can do direct loads and stores to the gru. Some of these stores will result
in an interrupt being sent back to the cpu that did the store.

The interrupt vector used for this interrupt is not in an IOAPIC. Instead
it must be loaded into the GRU at boot or driver initialization time.

The OS needs to route these interrupts back to the GRU driver interrupt
handler on the cpu that received the interrupt. Also, this is a performance
critical path. There should be no globally shared cachelines involved in the
routing.

The actual vector associated with the IRQ does not matter as long as it is
a relatively high priority interrupt. The vector does need to be mapped to
all of the possible CPUs in the partition. The GRU driver needs to know
vector's value, so that it can load it into the GRU.

    sgi-xp driver

The sgi-xp driver utilizes the node controller's message queue capability to
send messages from one system partition (a single SSI) to another partition.

A message queue can be configured to have the node controller raise an
interrupt whenever a message is written into it. This configuration is
accomplished by setting up two processor writable MMRs located in the
node controller. The vector number and apicid of the targeted CPU need
to be written into one of these MMRs. There is no IOAPIC associated with
this.

So one thought was that, once insmod'd, sgi-xp would allocate a message queue,
allocate an irq/vector pair for a CPU located on the node where the message
queue resides, and then set the MMRs with the memory address and length of the
message queue and the ...
From: Dimitri Sivanich
Date: Wednesday, September 17, 2008 - 10:30 am

In addition to the above, the high resolution RTC timers in the UV hardware require that a vector be specified in order to send an interrupt to a specific destination when a timer expires.  The MMR's for these timers require a vector to be or'ed in with other values, including the interrupt's destination.  This is therefore done at run-time.

Like the GRU's vector, this vector is not in an IOAPIC.  This vector would be made available to all cpu's within a partition (SSI) and should be coupled with a per-cpu irq.

This is very similiar to what was available in earlier SGI hardware and used in drivers/char/mmtimer.c.
--

From: Eric W. Biederman
Date: Wednesday, September 17, 2008 - 11:59 am

I need to look at these patches some more (apologies I got busy).

We can not assign a vector to all CPUS.

We don't have the fields for vector -> irq mapping for cpus that
are not-online.  So we can only assign to cpus that are online now.
And latter add the other cpus when they come online.

I have had that oops, it sucks, I don't want to go back.

Eric



--

From: Dean Nelson
Date: Thursday, September 18, 2008 - 6:37 am

Thanks for taking a look.

I've got a not-quite-complete new version of the patchset that addresses
most of the issues raised by Ingo. At it's heart is the following:


 arch/x86/kernel/irq.c___(new)_________________________________

#include <linux/irq.h>

static cpumask_t domain_online;

int grab_irq_vector(struct irq_desc *desc, unsigned int vector,
		   cpumask_t *domain)
{
	/* Must be called with vector lock held */
	int cpu;

	cpus_and(domain_online, *domain, cpu_online_map);

	for_each_cpu_mask_nr(cpu, domain_online) {
		if (per_cpu(vector_irq, cpu)[vector] != NULL)
			return -EBUSY;
	}

	/* Available reserve it */
	for_each_cpu_mask_nr(cpu, domain_online)
		per_cpu(vector_irq, cpu)[vector] = desc;

	return vector;
}


 arch/x86/kernel/io_apic.c_____________________________________

static int ioapic_grab_irq_vector(struct irq_desc *desc, unsigned int vector,
				  cpumask_t *domain)
{
	/* Must be called with vector lock held */
	struct irq_cfg *cfg;
	int ret;

	ret = grab_irq_vector(desc, vector, domain);
	if (ret == vector) {
		cfg = irq_cfg(desc->irq);
		if (cfg->vector) {
			cfg->move_in_progress = 1;
			cfg->old_domain = cfg->domain;
		}
		cfg->vector = vector;
		cfg->domain = *domain;
	}
	return ret;
}

I've also restructured the order of the patchset so that the first
three patches switch vector_irq from an irq # to an irq_desc pointer, 
replace system_vectors[] usage by a call to grab_irq_vector(), and
finally replace used_vectors[] usage by a call to grab_irq_vector().
You may find these three patches meaningful in and of themselves
since Ingo seemed to indicate that they cleaned up some of the code.

The last patch adds the dynamic allocate of system irq, which, if I'm
understanding correctly, needs to be reworked so that SGI's UV irq
needs get satisfied through a variant of MSI. The MSI code isn't

I reworked the above functions so that grab_irq_vector() enforces
online CPUS only. (I'm assuming your statements applies to ...
From: H. Peter Anvin
Date: Thursday, September 18, 2008 - 12:18 pm

What would seem most intuitive to me would have been, in this case, to
add this as a new MSI IRQ chip type.  We already have several of those
(PCI vs HT), and this *should* give you all the hooks you need.

	-hpa

--

From: H. Peter Anvin
Date: Wednesday, September 17, 2008 - 12:15 pm

Could you clarify there: is this one vector number per CPU, or are you 
issuing a specific vector number and just varying the CPU number?

	-hpa
--

From: Jack Steiner
Date: Wednesday, September 17, 2008 - 1:21 pm

It is one vector for each cpu.

It is more efficient for software if the vector # is the same for all cpus
but the software/hardware can support a unique vector for each cpu. This
assumes, of course, that the driver can determine the irq->vector mapping for
each cpu.


<probably-more-detail-than-you-want>

Physically, the system contains a large number of blades. Each blade has
several processor sockets plus a UV hub (node controller).  There are 2 GRUs
located in each UV hub.

Each GRU supports multiple users simultaneously using the GRU.
Each user is assigned a context number (0 .. N-1). If an exception occurs,
the GRU uses the context number as an index into an array of [vector-apicid] pairs.
The [vector-apicid] identifies the cpu & vector for the interrupt.

Although supported by hardware, we do not intend to send interrupts
off-blade.

The array of [vector-apicid] pairs is located in each GRU and must be
initialized at boot time or when the driver is loaded. There is a
separate array for each GRU.

When the driver receives the interrupt, the vector number (or IRQ number) is
used by the driver to determine the GRU that sent the interrupt.


The simpliest scheme would be to assign 2 vectors - one for each GRU in the UV hub.
Vector #0 would be loaded into each "vector" of the  [vector-apicid] array for GRU
#0; vector #1 would be loaded into the [vector-apicid] array for GRU #1.

The [vector-apicid] arrays on all nodes would be identical as far as vectors are
concerned. (Apicids would be different and would target blade-local cpus).
Since interrupts are not sent offnode, the driver can use the vector (irq)
to uniquely identify the source of the interrupt.

However, we have a lot of flexibilty here. Any scheme that provides the right
information to the driver is ok. Note that servicing of these interrupts
is likely to be time critical. We need this path to be as efficient as possible.



--- jack




--

From: Eric W. Biederman
Date: Wednesday, September 17, 2008 - 3:15 pm

Why?  Especially in terms of irq counting that would seem to lead to cache

That sounds like you have a non-standard MSI-X vector.  You certainly have all of
the same properties.  At which point create_irq() sounds like what you want.

One irq per cpu, per device.

It is the trend.  Don't worry all of the high performance drivers are doing it.
That is the path that will be optimized.

What you are proposing is some silly side path that will be ignored, and will be
increasingly less well supported over time as no other hardware does that.  Please
join the rest of the world.  Weird formats formats for programming irq information
into the hardware are easier to support than many other weird restrictions.

What function does the GRU perform that makes it more important and more special
than other hardware devices that requires it to have a high priority interrupt?

Eric

--

From: H. Peter Anvin
Date: Wednesday, September 17, 2008 - 6:09 pm

In particular, it's just another interrupt type.  We already have quite 
a few of those, from XT-PIC to the various IOAPIC ones, to MSI and MSI-X.

Just treating these as variants of MSI seems to me to make most sense, too.

	-hpa

--

From: Jack Steiner
Date: Thursday, September 18, 2008 - 12:10 pm

Functionally, it does not matter. However, if the IRQ is not a per-cpu IRQ, a
very large number of IRQs (and vectors) may be needed. The GRU requires 32 interrupt
lines on each blade. A large system can currently support up to 512 blades.

After looking thru the MSI code, we are starting to believe that we should separate
the GRU requirements from the XPC requirements. It looks like XPC can easily use
the MSI infrastructure.  XPC needs a small number of IRQs, and interrupts are typically
targeted to a single cpu. They can also be retargeted using the standard methods.

The GRU, OTOH, is more like a timer interrupt or like a co-processor interrupt.
GRU interrupts can occur on any cpu using the GRU. When interrupts do occur, all that
needs to happen is to call an interrupt handler. I'm thinking of something like
the following:

	- permanently reserve 2 system vectors in include/asm-x86/irq_vectors.h
	- in uv_system_init(), call alloc_intr_gate() to route the
	  interrupts to a function in the file containing uv_system_init().
	- initialize the GRU chipset with the vector, etc, ...
	- if an interrupt occurs and the GRU driver is NOT loaded, print
	  an error message (rate limited or one time)

	- provide a special UV hook for the GRU driver to register/deregister a
--

From: Eric W. Biederman
Date: Thursday, September 18, 2008 - 5:28 pm

Every vendor of high end hardware is saying they intend to provide
1 or 2 queues per cpu and 1 irq per queue.  So the GRU is not special in
that regard.  Also a very large number of IRQs is not a problem as
soon as we start dynamically allocating them, which is currently
in progress.

Once we start dynamically allocating irq_desc structures we can put
them in node-local memory and guarantee there is no data shared between

Alright. 

I would be completely happy if there were interrupts who's affinity we can

That would work.  So far the GRU doesn't sound that special.

For a lot of this I would much rather solve the general case on this
giving us a solution that works for all high end interrupts rather than
one specific solution just for the GRU.  Especially since it looks like
we have most of the infrastructure already present to solve the general
case and we have to develop and review the specific case from scratch.

Eric
--

From: Ingo Molnar
Date: Friday, September 19, 2008 - 1:48 am

ok, great.

Dean, just to make sure the useful bits are not lost now that the 
direction has been changed: could you please repost the patchset but 
without the driver API bits? It's still all a nice and useful 
generalization and cleanup of the x86 vector allocation code, and we can 
check it in -tip how well it works in practice.

	Ingo
--

Previous thread: [PATCH] cgroup.c: Some 'hlist_head' function fixes. by Rakib Mullick on Friday, August 8, 2008 - 8:29 am. (2 messages)

Next thread: [2.6 patch] fix watchdog/shwdt.c compilation by Adrian Bunk on Friday, August 8, 2008 - 8:39 am. (1 message)