Re: [PATCH 2/8] Add interface for queuing work on a specific CPU

Previous thread: Build failure on arch/x86/vdso/built-in.o: In function `init_vdso_vars' by Priit Laes on Thursday, February 7, 2008 - 2:01 am. (1 message)

Next thread: Re: Latest git oopses during boot by Andrew Morton on Thursday, February 7, 2008 - 3:02 am. (4 messages)
From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:18 am

Hi,

Since I'll be on vacation next week, I thought I'd send this out in
case people wanted to play with it. It works here, but I haven't done
any performance numbers at all.

Patches 1-7 are all preparation patches for #8, which contains the
real changes. I'm not particularly happy with the arch implementation
for raising a softirq on another CPU, but it should be fast enough
so suffice for testing.

Anyway, this patchset is mainly meant as a playground for testing IO
affinity. It allows you to set three values per queue, see the files
in the /sys/block/<dev>/queue directory:

completion_affinity
	Only allow completions to happen on the defined CPU mask.
queue_affinity
	Only allow queuing to happen on the defined CPU mask.
rq_affinity
	Always complete a request on the same CPU that queued it.

As you can tell, there's some overlap to allow for experimentation.
rq_affinity will override completion_affinity, so it's possible to
have completions on a CPU that isn't set in that mask. The interface
is currently limited to all CPUs or a specific CPU, but the implementation
is supports (and works with) cpu masks. The logic is in
blk_queue_set_cpumask(), it should be easy enough to change this to
echo a full mask, or allow OR'ing of CPU masks when a new CPU is passed in.
For now, echo a CPU number to set that CPU, or use -1 to set all CPUs.
The default is all CPUs for no change in behaviour.

Patch set is against current git as of this morning. The code is also in
the block git repo, branch is io-cpu-affinity.

git://git.kernel.dk/linux-2.6-block.git io-cpu-affinity

-- 
Jens Axboe



--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:18 am

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/workqueue.h |    1 +
 kernel/workqueue.c        |   15 +++++++++++++++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7f28c32..4c46944 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -179,6 +179,7 @@ __create_workqueue_key(const char *name, int singlethread,
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
 extern int FASTCALL(queue_work(struct workqueue_struct *wq, struct work_struct *work));
+extern int FASTCALL(queue_work_on_cpu(struct workqueue_struct *wq, struct work_struct *work, int cpu));
 extern int FASTCALL(queue_delayed_work(struct workqueue_struct *wq,
 			struct delayed_work *work, unsigned long delay));
 extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 52db48e..551ad7e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -175,6 +175,21 @@ int fastcall queue_work(struct workqueue_struct *wq, struct work_struct *work)
 }
 EXPORT_SYMBOL_GPL(queue_work);
 
+int fastcall queue_work_on_cpu(struct workqueue_struct *wq,
+			       struct work_struct *work, int cpu)
+{
+	int ret = 0;
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
+		BUG_ON(!list_empty(&work->entry));
+		__queue_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+		ret = 1;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_on_cpu);
+
 void delayed_work_timer_fn(unsigned long __data)
 {
 	struct delayed_work *dwork = (struct delayed_work *)__data;
-- 
1.5.4.22.g7a20

--

From: Andrew Morton
Date: Thursday, February 7, 2008 - 2:45 am

-ETOOWIDE


Might as well change queue_work() to call this?

Oleg cc'ed as queue_work() maintainer ;)
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:49 am

Good point, I'll do that locally at least.

-- 
Jens Axboe

--

From: Harvey Harrison
Date: Thursday, February 7, 2008 - 10:44 am

I'll be making another round-up patch at -rc1 for FASTCALL/fastcall

FYI

Harvey

--

From: Oleg Nesterov
Date: Monday, February 11, 2008 - 3:51 am

Sorry for delay,


This is possible, but in that case queue_work_on_cpu() should use wq_per_cpu(),
not per_cpu_ptr(). (otherwise queue_work(single_threaded_wq) won't work).


A bit off-topic, the comment near queue_work() says

	* We queue the work to the CPU it was submitted, but there is no
	* guarantee that it will be processed by that CPU.

This is wrong. Unless cpu_down() happens, we do guarantee it will be processed
by that CPU. Perhaps it makes sense to fix the comment as well?

Oleg.

--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:18 am

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/Makefile      |    3 +-
 block/blk-core.c    |   88 -------------------------------------------
 block/blk-softirq.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 105 insertions(+), 89 deletions(-)
 create mode 100644 block/blk-softirq.c

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..b064190 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -4,7 +4,8 @@
 
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
-			blk-exec.o blk-merge.o ioctl.o genhd.o scsi_ioctl.o
+			blk-exec.o blk-merge.o blk-softirq.o ioctl.o genhd.o \
+			scsi_ioctl.o
 
 obj-$(CONFIG_BLK_DEV_BSG)	+= bsg.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 4afb39c..11d79f6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,8 +26,6 @@
 #include <linux/swap.h>
 #include <linux/writeback.h>
 #include <linux/task_io_accounting_ops.h>
-#include <linux/interrupt.h>
-#include <linux/cpu.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 
@@ -50,8 +48,6 @@ struct kmem_cache *blk_requestq_cachep;
  */
 static struct workqueue_struct *kblockd_workqueue;
 
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
-
 static void drive_stat_acct(struct request *rq, int new_io)
 {
 	int rw = rq_data_dir(rq);
@@ -1605,82 +1601,6 @@ static int __end_that_request_first(struct request *req, int error,
 }
 
 /*
- * splice the completion data to a local structure and hand off to
- * process_completion_queue() to complete the requests
- */
-static void blk_done_softirq(struct softirq_action *h)
-{
-	struct list_head *cpu_list, local_list;
-
-	local_irq_disable();
-	cpu_list = &__get_cpu_var(blk_cpu_done);
-	list_replace_init(cpu_list, &local_list);
-	local_irq_enable();
-
-	while (!list_empty(&local_list)) {
-		struct request *rq;
-
-		rq = ...
From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 arch/x86/kernel/smp_32.c                   |   15 +++++++++++++++
 arch/x86/kernel/smpboot_32.c               |    3 +++
 include/asm-x86/hw_irq_32.h                |    1 +
 include/asm-x86/mach-default/entry_arch.h  |    1 +
 include/asm-x86/mach-default/irq_vectors.h |    1 +
 5 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dc0cde9..668b8a4 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -672,6 +672,21 @@ void smp_call_function_interrupt(struct pt_regs *regs)
 	}
 }
 
+fastcall void smp_raise_block_softirq(struct pt_regs *regs)
+{
+	unsigned long flags;
+
+	ack_APIC_irq();
+	local_irq_save(flags);
+	raise_softirq_irqoff(BLOCK_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void arch_raise_softirq_on_cpu(int cpu, unsigned int nr)
+{
+	send_IPI_mask(cpumask_of_cpu(cpu), BLOCK_SOFTIRQ_VECTOR);
+}
+
 static int convert_apicid_to_cpu(int apic_id)
 {
 	int i;
diff --git a/arch/x86/kernel/smpboot_32.c b/arch/x86/kernel/smpboot_32.c
index 579b9b7..ca35697 100644
--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -1304,6 +1304,9 @@ void __init smp_intr_init(void)
 
 	/* IPI for generic function call */
 	set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
+
+	/* IPI for scheduling block softirq */
+	set_intr_gate(BLOCK_SOFTIRQ_VECTOR, raise_block_softirq);
 }
 
 /*
diff --git a/include/asm-x86/hw_irq_32.h b/include/asm-x86/hw_irq_32.h
index ea88054..0a053d8 100644
--- a/include/asm-x86/hw_irq_32.h
+++ b/include/asm-x86/hw_irq_32.h
@@ -32,6 +32,7 @@ extern void (*const interrupt[NR_IRQS])(void);
 void reschedule_interrupt(void);
 void invalidate_interrupt(void);
 void call_function_interrupt(void);
+void raise_block_softirq(void);
 #endif
 
 #ifdef CONFIG_X86_LOCAL_APIC
diff --git a/include/asm-x86/mach-default/entry_arch.h b/include/asm-x86/mach-default/entry_arch.h
index ...
From: Ingo Molnar
Date: Thursday, February 7, 2008 - 3:07 am

if then this should be a general facility to trigger any softirq - not 

this wastes another irq vector and is very special-purpose. Why not make 
the smp_call_function() one more scalable instead?

on the more conceptual level, shouldnt we just move to threads instead 
of softirqs? That way you can become affine to any CPU and can do 
cross-CPU wakeups anytime - which will be nice and fast via the 
smp_reschedule_interrupt() facility.

	Ingo
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 3:17 am

Oh yeah, definitely agree, I wrote that in the intro as well. The

That's definitely a possibility, Nick had something like that. I just
didn't like having to allocate a cookie object to store the function and

That would indeed be nicer and not require any arch changes. I was
afraid it would be more costly than massaging the softirqs a bit though,
perhaps that is unfounded.

-- 
Jens Axboe

--

From: Ingo Molnar
Date: Thursday, February 7, 2008 - 3:25 am

pick up the threaded softirq patches from -rt, those move all softirqs 
processing into kernel threads. I'd suggest to extend those via 
wakeup-from-remote functionality - it fits the construct quite 
naturally. You should also be able to directly observe any performance 
impact of threaded softirq handlers. (and if you find any, let me know 
so that we can make it faster :-)

	Ingo
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 3:31 am

I was just considering that, since I knew -rt moved the softirqs into
threads. I'll look into it, but may not post anything until after my
vacation.

-- 
Jens Axboe

--

From: Ingo Molnar
Date: Thursday, February 7, 2008 - 3:38 am

we should more seriously investigate kernel thread scalability for 
another reason as well: besides -rt, any generic async IO facility we 
pick up will likely heavily rely on them. Kernel thread scheduling is 
quite a bit lighter than user task scheduling [no TLB flushes, etc.] - 
and if it is still not good enough we could probably accelerate them 
some more. (and everyone will benefit) irq-context softirqs on the other 
hand are quite rigid and bring in many atomicity assumptions so they are 
not as natural to program for.

	Ingo
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 7:18 am

Here's a patch on top of the other patchset that switches blk-softirq.c
to using kernel threads instead. It's pretty simple. Booted tested,
works for me :-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 5ae8345..73c71c7 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -7,6 +7,8 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <linux/kthread.h>
 #include <linux/cpu.h>
 
 #include "blk.h"
@@ -14,41 +16,136 @@
 struct blk_comp {
 	spinlock_t lock;
 	struct list_head list;
-	int dead;
+	struct task_struct *task;
 };
-static DEFINE_PER_CPU(struct blk_comp, blk_cpu_done);
+static DEFINE_PER_CPU(struct blk_comp, blk_irq_data);
+
+static void raise_blk_irq(int cpu)
+{
+	struct blk_comp *bc = &per_cpu(blk_irq_data, cpu);
+
+	wake_up_process(bc->task);
+}
+
+static void blk_done_softirq(struct list_head *list)
+{
+	while (!list_empty(list)) {
+		struct request *rq;
+
+		rq = list_entry(list->next, struct request, donelist);
+		list_del_init(&rq->donelist);
+		rq->q->softirq_done_fn(rq);
+	}
+}
+
+static int blk_irq_thread(void *data)
+{
+	struct blk_comp *bc = data;
+	struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO / 2 };
+
+	sched_setscheduler(current, SCHED_FIFO, &param);
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	while (!kthread_should_stop()) {
+		LIST_HEAD(local_list);
+
+		preempt_disable();
+		smp_mb();
+		if (list_empty(&bc->list)) {
+			preempt_enable_no_resched();
+			schedule();
+			preempt_disable();
+		}
+
+		__set_current_state(TASK_RUNNING);
+		smp_mb();
+		while (!list_empty(&bc->list)) {
+			spin_lock_irq(&bc->lock);
+			list_replace_init(&bc->list, &local_list);
+			spin_unlock_irq(&bc->lock);
+
+
+			blk_done_softirq(&local_list);
+		}
+		preempt_enable();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
+static int create_blk_irq_thread(int ...
From: Ingo Molnar
Date: Thursday, February 7, 2008 - 3:49 am

oh, you going on a vacation. I am sitting on a few block layer patches 
you might be interested in :-)

i am playing with Vegard Nossum's kmemcheck on x86 (with much help from 
Pekka Enberg for the SLUB details) and it's getting quite promising. 
Kmemcheck is in essence Valgrind for the native kernel - it is "Da Bomb" 
in terms of kernel object lifetime and access validation debugging 
helpers.

it promptly triggered a few uninitialized accesses in the block layer 
(amongst others), resulting in the following 4 fixes (find them below):

  block: restructure rq_init() to make it safer
  block: fix access to uninitialized field
  block: initialize rq->tag
  block: rq->cmd* initialization

i _think_ the actual uninitialized memory accesses are only latent bugs 
not actual runtime bugs because they relate to SCSI init sequences that 
do not truly need the elevator's attention - but rq_init() sure looked a 
bit dangerous in this regard and the elv_next_request() access to those 
uninitialized fields is not nice.

Do these fixes look good to you? I had them in testing for a few days 
already.

	Ingo

--------------------->
Subject: block: restructure rq_init() to make it safer
From: Ingo Molnar <mingo@elte.hu>

reorder the initializations done in rq_init() to both align them
to memory order, and to document them better. They now match
the field ordering of "struct request" in blkdev.h.

No change in functionality:

      text    data     bss     dec     hex filename
      8426       0      20    8446    20fe blk-core.o.before
      8426       0      20    8446    20fe blk-core.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 block/blk-core.c |   51 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)

Index: linux/block/blk-core.c
===================================================================
--- linux.orig/block/blk-core.c
+++ linux/block/blk-core.c
@@ -106,24 +106,43 @@ void rq_init(struct ...
From: Linus Torvalds
Date: Thursday, February 7, 2008 - 10:42 am

...

Can we please just stop doing these one-by-one assignments, and just do 
something like

	memset(rq, 0, sizeof(*rq));
	rq->q = q;
	rq->ref_count = 1;
	INIT_HLIST_NODE(&rq->hash);
	RB_CLEAR_NODE(&rq->rb_node);

instead?

The memset() is likely faster and smaller than one-by-one assignments 
anyway, even if the one-by-ones can avoid initializing some field or there 
ends up being a double initialization..

			Linus
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 10:55 am

I completely agree, the fidgeting with single members quickly loses
appeal. Ingo, I'll merge and integrate your fixes before leaving, I'll
be here all day tomorrow as well.

-- 
Jens Axboe

--

From: Ingo Molnar
Date: Thursday, February 7, 2008 - 12:31 pm

i definitely agree and do that for all code i write.

But if someone does item by item initialization for some crazy 
performance reason (networking folks tend to have such constructs), it 
should be done i think how i've done it in the patch: by systematically 
listing _every_ field in the structure, in the same order, and 
indicating it clearly when it is not initialized and why.

and there it already shows that we do not initialize a few other members 
that could cause problems later on:

+       rq->data_len                    = 0;
+       /* rq->sense_len                        */
+       rq->data                        = NULL;
+       rq->sense                       = NULL;

why is sense_len not initialized - while data_len is? In any case, these 
days the memclear instructions are dirt cheap and we should just always 
initialize everything to zero by default, especially if it's almost all 
zero-initialized anyway.

	Ingo
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 1:06 pm

That assumes that people find the references in two places when adding

because sense isn't set, when someone sets ->sense they should set

Completely agree, some of these are just dormant bugs waiting to happen.
Clearing everything is the sanest approach.

-- 
Jens Axboe

--

From: David Miller
Date: Thursday, February 7, 2008 - 6:22 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

The problem is store buffer compression.  At least a few years
ago this made a huge difference in sk_buff initialization in the
networking.

Maybe cpus these days have so much store bandwith that doing
things like the above is OK, but I doubt it :-)
--

From: Linus Torvalds
Date: Thursday, February 7, 2008 - 6:28 pm

I seriously doubt the same is true for the IO requests (which are 
different anyway, and tend to happen at a much lower frequency than 
high-speed networking).

I also suspect that your timings were very hardware-dependent in the first 
place (and yes, the long-term effect is likely working against that 
"optimization"), and depend a lot on just how sparse the initialization 
can to be.

		Linus
--

From: Arjan van de Ven
Date: Friday, February 8, 2008 - 8:09 am

on modern x86 cpus the memset may even be faster if the memory isn't in cache;
the "explicit" method ends up doing Write Allocate on the cache lines
(so read them from memory) even though they then end up being written entirely.
With memset the CPU is told that the entire range is set to a new value, and
the WA can be avoided for the whole-cachelines in the range.

--

From: Nick Piggin
Date: Friday, February 8, 2008 - 3:44 pm

Don't you have write combining store buffers? Or is it still speculatively
issuing the reads even before the whole cacheline is combined?

--

From: Arjan van de Ven
Date: Friday, February 8, 2008 - 3:56 pm

x86 memory order model doesn't allow that quite; and you need a "series" of at least 64 bytes
without any other memory accesses in between even if it would....
not happening in practice.
--

From: Nick Piggin
Date: Friday, February 8, 2008 - 4:58 pm

OK, fair enough... then it will be a very nice test to see if it
helps. I'm sure you could have an arch specific initialisation
function if it makes a significant difference.
--

From: Jens Axboe
Date: Friday, February 8, 2008 - 4:38 am

Looked into this approach and we can't currently do that here, since
some members of the request are being set from blk_alloc_request() and
then from the io scheduler attaching private data to it. So we have to
preserve ->cmd_flags and ->elevator_private and ->elevator_private2 at
least. Now rq_init() is also used for stored requests, so we cannot just
rely on clearing at allocation time.

So I'd prefer being a little conservative here. The below reorders
rq_init() a bit and clears some more members to be on the safer side,
adding comments to why we cannot memset and an associated comment in
blkdev.h.

diff --git a/block/blk-core.c b/block/blk-core.c
index 4afb39c..fba4ca7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -102,27 +102,38 @@ struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev)
 }
 EXPORT_SYMBOL(blk_get_backing_dev_info);
 
+/*
+ * We can't just memset() the structure, since the allocation path
+ * already stored some information in the request.
+ */
 void rq_init(struct request_queue *q, struct request *rq)
 {
 	INIT_LIST_HEAD(&rq->queuelist);
 	INIT_LIST_HEAD(&rq->donelist);
-
-	rq->errors = 0;
+	rq->q = q;
+	rq->sector = rq->hard_sector = (sector_t) -1;
+	rq->nr_sectors = rq->hard_nr_sectors = 0;
+	rq->current_nr_sectors = rq->hard_cur_sectors = 0;
 	rq->bio = rq->biotail = NULL;
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
+	rq->rq_disk = NULL;
+	rq->nr_phys_segments = 0;
+	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
+	rq->special = NULL;
 	rq->buffer = NULL;
+	rq->tag = -1;
+	rq->errors = 0;
 	rq->ref_count = 1;
-	rq->q = q;
-	rq->special = NULL;
+	rq->cmd_len = 0;
+	memset(rq->cmd, 0, sizeof(rq->cmd));
 	rq->data_len = 0;
+	rq->sense_len = 0;
 	rq->data = NULL;
-	rq->nr_phys_segments = 0;
 	rq->sense = NULL;
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
-	rq->completion_data = NULL;
 	rq->next_rq = NULL;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ...
From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 arch/x86/kernel/entry_64.S  |    3 +++
 arch/x86/kernel/i8259_64.c  |    3 +++
 arch/x86/kernel/smp_64.c    |   15 +++++++++++++++
 include/asm-x86/hw_irq_64.h |    2 ++
 4 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index c7341e8..753834b 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -704,6 +704,9 @@ END(invalidate_interrupt\num)
 ENTRY(call_function_interrupt)
 	apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
 END(call_function_interrupt)
+ENTRY(raise_block_softirq)
+	apicinterrupt BLOCK_SOFTIRQ_VECTOR,smp_raise_block_softirq
+END(raise_block_softirq);
 ENTRY(irq_move_cleanup_interrupt)
 	apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt
 END(irq_move_cleanup_interrupt)
diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c
index fa57a15..cd7675b 100644
--- a/arch/x86/kernel/i8259_64.c
+++ b/arch/x86/kernel/i8259_64.c
@@ -494,6 +494,9 @@ void __init native_init_IRQ(void)
 	/* IPI for generic function call */
 	set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
 
+	/* IPI for raising block softirq on another CPU */
+	set_intr_gate(BLOCK_SOFTIRQ_VECTOR, raise_block_softirq);
+
 	/* Low priority IPI to cleanup after moving an irq */
 	set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
 #endif
diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
index 2fd74b0..fe4c2bf 100644
--- a/arch/x86/kernel/smp_64.c
+++ b/arch/x86/kernel/smp_64.c
@@ -460,6 +460,21 @@ int smp_call_function (void (*func) (void *info), void *info, int nonatomic,
 }
 EXPORT_SYMBOL(smp_call_function);
 
+asmlinkage void smp_raise_block_softirq(void)
+{
+	unsigned long flags;
+
+	ack_APIC_irq();
+	local_irq_save(flags);
+	raise_softirq_irqoff(BLOCK_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void arch_raise_softirq_on_cpu(int cpu, unsigned int ...
From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 arch/ia64/kernel/smp.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index 4e446aa..96f8ffa 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -80,6 +80,7 @@ static volatile struct call_data_struct *call_data;
 #define IPI_CALL_FUNC		0
 #define IPI_CPU_STOP		1
 #define IPI_KDUMP_CPU_STOP	3
+#define IPI_BLOCK_SOFTIRQ	4
 
 /* This needs to be cacheline aligned because it is written to by *other* CPUs.  */
 static DEFINE_PER_CPU_SHARED_ALIGNED(u64, ipi_operation);
@@ -174,6 +175,14 @@ handle_IPI (int irq, void *dev_id)
 				unw_init_running(kdump_cpu_freeze, NULL);
 				break;
 #endif
+			      case IPI_BLOCK_SOFTIRQ: {
+				unsigned long flags;
+
+				local_irq_save(flags);
+				raise_softirq_irqoff(BLOCK_SOFTIRQ);
+				local_irq_restore(flags);
+				break;
+			      }
 			      default:
 				printk(KERN_CRIT "Unknown IPI on CPU %d: %lu\n", this_cpu, which);
 				break;
@@ -461,6 +470,11 @@ smp_call_function (void (*func) (void *info), void *info, int nonatomic, int wai
 }
 EXPORT_SYMBOL(smp_call_function);
 
+void arch_raise_softirq_on_cpu(int cpuid)
+{
+	send_IPI_single(cpuid, IPI_BLOCK_SOFTIRQ);
+}
+
 /*
  * this function calls the 'stop' function on all other CPUs in the system.
  */
-- 
1.5.4.22.g7a20

--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

Preparatory patch for checking queuing affinity.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/as-iosched.c     |    6 +++---
 block/blk-core.c       |    8 ++++----
 block/cfq-iosched.c    |    2 +-
 include/linux/blkdev.h |    2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 8c39467..6ef766f 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -450,7 +450,7 @@ static void as_antic_stop(struct as_data *ad)
 			del_timer(&ad->antic_timer);
 		ad->antic_status = ANTIC_FINISHED;
 		/* see as_work_handler */
-		kblockd_schedule_work(&ad->antic_work);
+		kblockd_schedule_work(ad->q, &ad->antic_work);
 	}
 }
 
@@ -471,7 +471,7 @@ static void as_antic_timeout(unsigned long data)
 		aic = ad->io_context->aic;
 
 		ad->antic_status = ANTIC_FINISHED;
-		kblockd_schedule_work(&ad->antic_work);
+		kblockd_schedule_work(q, &ad->antic_work);
 
 		if (aic->ttime_samples == 0) {
 			/* process anticipated on has exited or timed out*/
@@ -831,7 +831,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
-		kblockd_schedule_work(&ad->antic_work);
+		kblockd_schedule_work(q, &ad->antic_work);
 		ad->changed_batch = 0;
 
 		if (ad->batch_data_dir == REQ_SYNC)
diff --git a/block/blk-core.c b/block/blk-core.c
index 11d79f6..34a475b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -282,7 +282,7 @@ void blk_unplug_timeout(unsigned long data)
 	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
-	kblockd_schedule_work(&q->unplug_work);
+	kblockd_schedule_work(q, &q->unplug_work);
 }
 
 void blk_unplug(struct request_queue *q)
@@ -323,7 +323,7 @@ void blk_start_queue(struct request_queue *q)
 		clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
 	} else {
 		blk_plug_device(q);
-		kblockd_schedule_work(&q->unplug_work);
+		kblockd_schedule_work(q, ...
From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

This only works for the block softirq, due to the hackish method of
the arch implementations.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/smp.h |    5 +++++
 kernel/softirq.c    |    7 +++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 55232cc..2b546c0 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -61,6 +61,7 @@ int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
  * Call a function on all processors
  */
 int on_each_cpu(void (*func) (void *info), void *info, int retry, int wait);
+extern void FASTCALL(raise_softirq_on_cpu(int cpu, unsigned int nr));
 
 #define MSG_ALL_BUT_SELF	0x8000	/* Assume <32768 CPU's */
 #define MSG_ALL			0x8001
@@ -112,6 +113,10 @@ static inline void smp_send_reschedule(int cpu) { }
 })
 #define smp_call_function_mask(mask, func, info, wait) \
 			(up_smp_call_function(func, info))
+#define raise_softirq_on_cpu(cpu, nr)	do {	\
+	WARN_ON(!irqs_disabled());		\
+	raise_softirq_irqoff((nr));		\
+} while (0)
 
 #endif /* !SMP */
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d7837d4..0bad5c3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -663,4 +663,11 @@ int on_each_cpu(void (*func) (void *info), void *info, int retry, int wait)
 	return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
+
+extern void arch_raise_softirq_on_cpu(int, unsigned int);
+void fastcall raise_softirq_on_cpu(int cpu, unsigned int nr)
+{
+	BUG_ON(nr != BLOCK_SOFTIRQ);
+	arch_raise_softirq_on_cpu(cpu, nr);
+}
 #endif
-- 
1.5.4.22.g7a20

--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 2:19 am

Supports several modes:

- Force IO queue affinity to a specific mask of CPUs
- Force IO completion affinity to a specific mask of CPUs
- Force completion of a request on the same CPU that queued it

Test code so far.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/blk-core.c       |   80 ++++++++++++++++++++++++-------------
 block/blk-settings.c   |   47 +++++++++++++++++++++
 block/blk-softirq.c    |  105 ++++++++++++++++++++++++++++++++++++++----------
 block/blk-sysfs.c      |   85 ++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |    8 ++++
 5 files changed, 275 insertions(+), 50 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 34a475b..bda4623 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -103,6 +103,7 @@ void rq_init(struct request_queue *q, struct request *rq)
 	INIT_LIST_HEAD(&rq->queuelist);
 	INIT_LIST_HEAD(&rq->donelist);
 
+	rq->cpu = -1;
 	rq->errors = 0;
 	rq->bio = rq->biotail = NULL;
 	INIT_HLIST_NODE(&rq->hash);
@@ -180,6 +181,11 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
+static inline int blk_is_io_cpu(struct request_queue *q)
+{
+	return cpu_isset(smp_processor_id(), q->queue_cpu);
+}
+
 /*
  * "plug" the device if there are no outstanding requests: this will
  * force the transfer to start only after we have put all the requests
@@ -200,6 +206,10 @@ void blk_plug_device(struct request_queue *q)
 		return;
 
 	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
+		/*
+		 * no need to care about the io cpu here, since the
+		 * timeout handler needs to punt to kblockd anyway
+		 */
 		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
 		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
 	}
@@ -299,6 +309,22 @@ void blk_unplug(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_unplug);
 
+static void blk_invoke_request_fn(struct request_queue *q)
+{
+	/*
+	 * one level of recursion is ok and is much faster than ...
From: Alan D. Brunelle
Date: Thursday, February 7, 2008 - 8:16 am

FYI: on a kernel with this patch set, running on a 4-way ia64 (non-NUMA) w/ a FC disk, I crafted a test with 135 combinations:

o  Having the issuing application pegged on each CPU - or - left alone (run on any CPU), yields 5 possibilities
o  Having the queue affinity on each CPU, or any (-1), yields 5 possibilities
o  Having the completion affinity on each CPU, or any (-1), yields 5 possibilities

and

o  Having the issuing application pegged on each CPU - or - left alone (run on ay CPU), yields 5 possibilities
o  Having rq_affinity set to 0 or 1, yields 2 possibilities.

Each test was for 10 minutes, and ran overnight just fine. The difference amongst the 135 resulting values (based upon latency per-IO seen at the application layer) was <<1% (0.32% to be exact). This would seem to indicate that there isn't a penalty for running with this code, and it seems relatively stable given this.

The application used was doing 64KiB asynchronous direct reads, and had a minimum average per-IO latency of 42.426310 milliseconds, and average of 42.486557 milliseconds (std dev of 0.0041561), and a max of 42.561360 milliseconds

I'm going to do some runs on a 16-way NUMA box, w/ a lot of disks today, to see if we see gains in that environment.

Alan D. Brunelle
HP OSLO S&P
--

From: Jens Axboe
Date: Thursday, February 7, 2008 - 11:25 am

Hi,

Here's a variant using kernel threads only, the nasty arch bits are then
not needed. Works for me, no performance testing (that's a hint for Alan
to try and queue up some testing for this variant as well :-)

-- 
Jens Axboe

From: Alan D. Brunelle
Date: Thursday, February 7, 2008 - 1:40 pm

I'll get to that, working my way through the first batch of testing on a NUMA platform. [[If anybody has ideas on specific testing to do, that would be helpful.]] I do plan on running some AIM7 tests, as those have shown improvement in other types of affinity changes in the kernel, and some of them have "interesting" IO load characteristics.

Alan

--

From: Nick Piggin
Date: Friday, February 8, 2008 - 12:38 am

Well this stuff looks pretty nice (although I'm not sure whether the
softirq->thread changes are a good idea for performance, I guess we'll
see).

You still don't have the option that the Intel patch gave, that is,
to submit on the completer. I guess that you could do it somewhat
generically by having a cpuid in the request queue, and update that
with the completing cpu.

At least they reported it to be the most efficient scheme in their
testing, and Dave thought that migrating completions out to submitters
might be a bottleneck in some cases.

--

From: Jens Axboe
Date: Friday, February 8, 2008 - 12:47 am

Yeah, that is indeed an open question and why I have two seperate
patches for now (io-cpu-affinity branch and io-cpu-affinity-kthread
branch). As Ingo mentioned, this is how softirqs are handled in the -rt

Not sure what you mean, if setting queue_affinity doesn't accomplish it.
If you know the completing CPU to begin with, surely you can just set

More so than migrating submitters to completers? The advantage of only
movign submitters is that you get rid of the completion locking. Apart
from that, the cost should be the same, especially for the thread based
solution.

-- 
Jens Axboe

--

From: Nick Piggin
Date: Friday, February 8, 2008 - 12:53 am

True, although there are some IO workloads where -rt falls behind


Not specifically for the block layer, but higher layers like xfs.

--

From: Jens Axboe
Date: Friday, February 8, 2008 - 12:59 am

Well if you don't ask for anything, you wont get anything :-)
As I mentioned, the patch is a playing ground for trying various setups.
Everything defaults to 'do as usual', set options to setup certain test

True, but that's parallel to the initial statement - that migrating
completers is most costly than migrating submitters. So I'd like Dave to
expand on why he thinks that migrating completers it more costly than
submitters, APART from the locking associated with adding the request to
a remote CPU list.

-- 
Jens Axboe

--

From: Nick Piggin
Date: Friday, February 8, 2008 - 1:12 am

I mean if you don't know the completing CPU.

--


I still don't know quite what part of that patch you are referring to
here. If you don't have queue_affinity set, queueing a new request with
the hardware is generally done on the same CPU that just completed a
request. That is true even without any patches.

-- 
Jens Axboe

--

From: Nick Piggin
Date: Friday, February 8, 2008 - 1:33 am

Generally, but I guess not always. The database workloads in question
(which you might know very well about ;)) apparently has a lot of
queue empty and unplug conditions. Which I guess is the reason for
Intel's initial patch.

--

From: David Chinner
Date: Sunday, February 10, 2008 - 10:22 pm

What I think Nick is referring to is the comments I made that at a
higher layer (e.g. filesystems) migrating completions to the
submitter CPU may be exactly the wrong thing to do. I don't recall
making any comments on migrating submitters - I think others have
already commented on that so I'll ignore that for the moment and
try to explain why completion on submitter CPU /may/ be bad.

For example, in the case of XFS it is fine for data I/O but it is
wrong for transaction I/O completion. We want to direct all
transaction completions to as few CPUs as possible (one, ideally) so
that all the completion processing happens on the same CPU, rather
than bouncing global cachelines and locks between all the CPUs
taking completion interrupts.

In more detail, the XFS transaction subsystem is asynchronous.
We submit the transaction I/O on the CPU that creates the
transaction so the I/O can come from any CPU in the system.  If we
then farm the completion processing out to the submission CPU, that
will push it all over the machine and guarantee that we bounce all
of the XFS transaction log structures and locks all over the machine
on completion as well as submission (right now it's lots of
submission CPUs, few completion CPUs).

An example how bad this can get - this patch:

http://oss.sgi.com/archives/xfs/2007-11/msg00217.html

which prevents simultaneous access to the items tracked in the log
during transaction reservation. Having several hundred CPUs trying
to hit this list at once is really bad for performance - the test
app on the 2048p machine that saw this problem went from ~5500s
runtime down to 9s with the above patch.

I use this example because the transaction I/O completion touches
exactly the same list, locks and structures but is limited in
distribution (and therefore contention) by the number of
simultaneous I/O completion CPUs. Doing completion on the submitter
CPU will cause much wider distribution of completion processing and
introduce exactly the same issues as ...
From: Jeremy Higdon
Date: Tuesday, February 12, 2008 - 1:28 am

So what you want is all XFS processing (for a given filesystem,
presumably) on a limited set of cores (ideally 1) and all block
and SCSI processing (for a given device) on a similarly limited
set.

On Altix, that was far more important than having the interrupt
and issue CPU be close to the hardware -- at least with typical
LSI or Qlogic controllers where there are only one or two MMIO
reads per command issued, and completions can be stacked up.

There is still an advantage to being close to the hardware, but
a much bigger advantage to not bouncing cachelines.

Maybe what you want is a multistage completion mechanism where
each stage can run on a different CPU, if thread context switches
are cheaper than bouncing data structures around....

jeremy
--

Previous thread: Build failure on arch/x86/vdso/built-in.o: In function `init_vdso_vars' by Priit Laes on Thursday, February 7, 2008 - 2:01 am. (1 message)

Next thread: Re: Latest git oopses during boot by Andrew Morton on Thursday, February 7, 2008 - 3:02 am. (4 messages)