Hi, Here's a new round of patches to play with io cpu affinity. It can, as always, also be found in the block git repo. The branch name is 'io-cpu-affinity'. The major change since last post is the abandonment of the kthread approach. It was definitely slower then may 'add IPI to signal remote block softirq' hack. So I decided to base this on the scalable smp_call_function_single() that Nick posted. I tweaked it a bit to make it more suitable for my use and also faster. As for functionality, the only change is that I added a bio hint that the submitter can use to ask for completion on the same CPU that submitted the IO. Pass in BIO_CPU_AFFINE for that to occur. Otherwise the modes are the same as last time: - You can set a specific cpumask for queuing IO, and the block layer will move submitters to one of those CPUs. - You can set a specific cpumask for completion of IO, in which case the block layer will move the completion to one of those CPUs. - You can set rq_affinity mode, in which case IOs will always be completed on the CPU that submitted them. Look in /sys/block/<dev>/queue/ for the three sysfs variables that modify this behaviour. I'd be interested in getting some testing done on this, to see if it really helps the larger end of the scale. Dave, I know you have a lot of experience in this area and would appreciate your input and/or testing. I'm not sure if any of the above modes will allow you to do what you need for eg XFS - if you want all meta data IO completed on one (or a set of) CPU(s), then I can add a mode that will allow you to play with that. Or if something else, give me some input and we can take it from there! Patches are against latest -git. -- Jens Axboe --
From: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> --- arch/x86/kernel/entry_64.S | 3 + arch/x86/kernel/i8259_64.c | 1 + arch/x86/kernel/smp_64.c | 303 +++++++++++++++++++++-------- include/asm-x86/hw_irq_64.h | 4 +- include/asm-x86/mach-default/entry_arch.h | 1 + include/linux/smp.h | 2 +- 6 files changed, 232 insertions(+), 82 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index c20c9e7..22caf56 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -713,6 +713,9 @@ END(invalidate_interrupt\num) ENTRY(call_function_interrupt) apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt END(call_function_interrupt) +ENTRY(call_function_single_interrupt) + apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_single_interrupt +END(call_function_single_interrupt) ENTRY(irq_move_cleanup_interrupt) apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt END(irq_move_cleanup_interrupt) diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c index fa57a15..2b0b6d2 100644 --- a/arch/x86/kernel/i8259_64.c +++ b/arch/x86/kernel/i8259_64.c @@ -493,6 +493,7 @@ void __init native_init_IRQ(void) /* IPI for generic function call */ set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt); + set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_single_interrupt); /* Low priority IPI to cleanup after moving an irq */ set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt); diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c index 2fd74b0..1196a12 100644 --- a/arch/x86/kernel/smp_64.c +++ b/arch/x86/kernel/smp_64.c @@ -18,6 +18,7 @@ #include <linux/kernel_stat.h> #include <linux/mc146818rtc.h> #include <linux/interrupt.h> +#include <linux/rcupdate.h> #include <asm/mtrr.h> #include ...
Why is this necessary? How is smp_call_function_single slow? --
Because it's completely serialized by the call_lock spinlock. -- Jens Axboe --
Hm, yes. Would it be possible to implement smp_call_function_mask in a
generic way to avoid that? Turn the static structure into a per-cpu
request list?
J
--
Not really. The common cases (that I can see) are either call all, or call one. In the call all case, you would have to touch every other CPU's request list, and that's not really any better than what I've done in my patchest for that. There would presumably be some cutoff where it makes more sense to queue events to the percpu IPI lists if you are only sending to a few CPUs. That would be trivial to implement, but... what are the use-cases for that? The big one that I really know of is user TLB shootdown, but that has its own vector anyway. --
Have you looked at the patches you are replying to? :-) -- Jens Axboe --
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
block/Makefile | 3 +-
block/blk-core.c | 88 -------------------------------------------
block/blk-softirq.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 105 insertions(+), 89 deletions(-)
create mode 100644 block/blk-softirq.c
diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..b064190 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -4,7 +4,8 @@
obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
- blk-exec.o blk-merge.o ioctl.o genhd.o scsi_ioctl.o
+ blk-exec.o blk-merge.o blk-softirq.o ioctl.o genhd.o \
+ scsi_ioctl.o
obj-$(CONFIG_BLK_DEV_BSG) += bsg.o
obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 2a438a9..46819c1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -26,8 +26,6 @@
#include <linux/swap.h>
#include <linux/writeback.h>
#include <linux/task_io_accounting_ops.h>
-#include <linux/interrupt.h>
-#include <linux/cpu.h>
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>
@@ -50,8 +48,6 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;
-static DEFINE_PER_CPU(struct list_head, blk_cpu_done);
-
static void drive_stat_acct(struct request *rq, int new_io)
{
int rw = rq_data_dir(rq);
@@ -1632,82 +1628,6 @@ static int __end_that_request_first(struct request *req, int error,
}
/*
- * splice the completion data to a local structure and hand off to
- * process_completion_queue() to complete the requests
- */
-static void blk_done_softirq(struct softirq_action *h)
-{
- struct list_head *cpu_list, local_list;
-
- local_irq_disable();
- cpu_list = &__get_cpu_var(blk_cpu_done);
- list_replace_init(cpu_list, &local_list);
- local_irq_enable();
-
- while (!list_empty(&local_list)) {
- struct request *rq;
-
- rq = ...Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
include/linux/workqueue.h | 2 ++
kernel/workqueue.c | 33 ++++++++++++++++++++++++++-------
2 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 542526c..fcc400b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -181,6 +181,8 @@ extern void destroy_workqueue(struct workqueue_struct *wq);
extern int queue_work(struct workqueue_struct *wq, struct work_struct *work);
extern int queue_delayed_work(struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
+extern int queue_work_on_cpu(struct workqueue_struct *wq,
+ struct work_struct *work, int cpu);
extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ff06611..6bbd7b0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -152,25 +152,44 @@ static void __queue_work(struct cpu_workqueue_struct *cwq,
}
/**
- * queue_work - queue work on a workqueue
+ * queue_work_on_cpu - queue work on a workqueue on a specific CPU
* @wq: workqueue to use
* @work: work to queue
+ * @cpu: cpu to queue the work on
*
* Returns 0 if @work was already on a queue, non-zero otherwise.
- *
- * We queue the work to the CPU it was submitted, but there is no
- * guarantee that it will be processed by that CPU.
*/
-int queue_work(struct workqueue_struct *wq, struct work_struct *work)
+int queue_work_on_cpu(struct workqueue_struct *wq, struct work_struct *work,
+ int cpu)
{
int ret = 0;
if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
BUG_ON(!list_empty(&work->entry));
- __queue_work(wq_per_cpu(wq, get_cpu()), work);
- put_cpu();
+ __queue_work(wq_per_cpu(wq, cpu), work);
ret = 1;
}
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(queue_work_on_cpu);
+
+/**
+ * ...Preparatory patch for checking queuing affinity.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
block/as-iosched.c | 6 +++---
block/blk-core.c | 8 ++++----
block/cfq-iosched.c | 2 +-
include/linux/blkdev.h | 2 +-
4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 8c39467..6ef766f 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -450,7 +450,7 @@ static void as_antic_stop(struct as_data *ad)
del_timer(&ad->antic_timer);
ad->antic_status = ANTIC_FINISHED;
/* see as_work_handler */
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(ad->q, &ad->antic_work);
}
}
@@ -471,7 +471,7 @@ static void as_antic_timeout(unsigned long data)
aic = ad->io_context->aic;
ad->antic_status = ANTIC_FINISHED;
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(q, &ad->antic_work);
if (aic->ttime_samples == 0) {
/* process anticipated on has exited or timed out*/
@@ -831,7 +831,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
}
if (ad->changed_batch && ad->nr_dispatched == 1) {
- kblockd_schedule_work(&ad->antic_work);
+ kblockd_schedule_work(q, &ad->antic_work);
ad->changed_batch = 0;
if (ad->batch_data_dir == REQ_SYNC)
diff --git a/block/blk-core.c b/block/blk-core.c
index 46819c1..ec529dc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -299,7 +299,7 @@ void blk_unplug_timeout(unsigned long data)
blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
q->rq.count[READ] + q->rq.count[WRITE]);
- kblockd_schedule_work(&q->unplug_work);
+ kblockd_schedule_work(q, &q->unplug_work);
}
void blk_unplug(struct request_queue *q)
@@ -340,7 +340,7 @@ void blk_start_queue(struct request_queue *q)
clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags);
} else {
blk_plug_device(q);
- kblockd_schedule_work(&q->unplug_work);
+ kblockd_schedule_work(q, ...Supports several modes:
- Force IO queue affinity to a specific mask of CPUs
- Force IO completion affinity to a specific mask of CPUs
- Force completion of a request on the same CPU that queued it
- Allow IO submitter to set BIO_CPU_AFFINE in the bio, in which case
completion will be done on the same CPU as the submitter
Test code so far, this variant being based on using the more scalable
__smp_call_function_single().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
block/blk-core.c | 77 ++++++++++++++++++++++-------------
block/blk-settings.c | 49 ++++++++++++++++++++++-
block/blk-softirq.c | 98 +++++++++++++++++++++++++++++++--------------
block/blk-sysfs.c | 86 ++++++++++++++++++++++++++++++++++++++++
include/linux/bio.h | 1 +
include/linux/blkdev.h | 12 +++++-
include/linux/elevator.h | 8 ++--
7 files changed, 264 insertions(+), 67 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index ec529dc..8b04a15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -110,7 +110,7 @@ EXPORT_SYMBOL(blk_get_backing_dev_info);
void rq_init(struct request_queue *q, struct request *rq)
{
INIT_LIST_HEAD(&rq->queuelist);
- INIT_LIST_HEAD(&rq->donelist);
+ rq->cpu = -1;
rq->q = q;
rq->sector = rq->hard_sector = (sector_t) -1;
rq->nr_sectors = rq->hard_nr_sectors = 0;
@@ -197,6 +197,11 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
}
EXPORT_SYMBOL(blk_dump_rq_flags);
+static inline int blk_is_io_cpu(struct request_queue *q)
+{
+ return cpu_isset(smp_processor_id(), q->queue_cpu);
+}
+
/*
* "plug" the device if there are no outstanding requests: this will
* force the transfer to start only after we have put all the requests
@@ -217,6 +222,10 @@ void blk_plug_device(struct request_queue *q)
return;
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
+ /*
+ * no need to care about the io cpu here, since the
+ * timeout handler needs to punt to ...Add a __smp_call_function_single() that allows passing in the
caller data to avoid an allocation.
It's OK to have interrupts disabled, as long as we don't wait for
the IPI call to finish.
Get rid of the fallback data and pass back an error instead. Callers
that don't want to handle errors must either use wait == 1, or pre-allocate
the data and use the __smp_call_function_single() variant.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
arch/x86/kernel/smp_64.c | 117 +++++++++++++++++++++++----------------------
include/linux/smp.h | 8 +++
2 files changed, 68 insertions(+), 57 deletions(-)
diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c
index 1196a12..b1a3d3c 100644
--- a/arch/x86/kernel/smp_64.c
+++ b/arch/x86/kernel/smp_64.c
@@ -298,6 +298,8 @@ void smp_send_reschedule(int cpu)
#define CALL_WAIT 0x01
#define CALL_FALLBACK 0x02
+#define CALL_DATA_ALLOC 0x04
+
/*
* Structure and data for smp_call_function(). This is designed to minimise
* static memory requirements. It also looks cleaner.
@@ -330,23 +332,12 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}
-
-struct call_single_data {
- struct list_head list;
- void (*func) (void *info);
- void *info;
- unsigned int flags;
-};
-
struct call_single_queue {
spinlock_t lock;
struct list_head list;
};
static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
-static unsigned long call_single_fallback_used;
-static struct call_single_data call_single_data_fallback;
-
int __cpuinit init_smp_call(void)
{
int i;
@@ -416,7 +407,8 @@ int smp_call_function_mask(cpumask_t mask,
data = &call_data_fallback;
flags |= CALL_FALLBACK;
/* XXX: can IPI all to "synchronize" RCU? */
- }
+ } else
+ flags |= CALL_DATA_ALLOC;
spin_lock_init(&data->lock);
data->func = func;
@@ -446,7 +438,7 @@ int smp_call_function_mask(cpumask_t mask,
/* Wait for response */
while (data->flags)
cpu_relax();
- if ...Based on Nicks patch for x86-64, and with my tweaks thrown in.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
arch/x86/kernel/smp_32.c | 309 +++++++++++++++++++++-------
arch/x86/kernel/smpboot_32.c | 4 +
arch/x86/kernel/smpcommon_32.c | 34 ---
include/asm-x86/hw_irq_32.h | 1 +
include/asm-x86/mach-default/irq_vectors.h | 1 +
5 files changed, 242 insertions(+), 107 deletions(-)
diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dc0cde9..dec7cd3 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -476,20 +476,32 @@ static void native_smp_send_reschedule(int cpu)
send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}
+#define CALL_WAIT 0x01
+#define CALL_FALLBACK 0x02
+#define CALL_DATA_ALLOC 0x04
+
/*
* Structure and data for smp_call_function(). This is designed to minimise
* static memory requirements. It also looks cleaner.
*/
static DEFINE_SPINLOCK(call_lock);
-struct call_data_struct {
+struct call_data {
+ spinlock_t lock;
+ struct list_head list;
void (*func) (void *info);
void *info;
- atomic_t started;
- atomic_t finished;
- int wait;
+ unsigned int flags;
+ unsigned int refs;
+ cpumask_t cpumask;
+ struct rcu_head rcu_head;
};
+static LIST_HEAD(call_queue);
+
+static unsigned long call_fallback_used;
+static struct call_data call_data_fallback;
+
void lock_ipi_call_lock(void)
{
spin_lock_irq(&call_lock);
@@ -500,39 +512,35 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}
-static struct call_data_struct *call_data;
+struct call_single_queue {
+ spinlock_t lock;
+ struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
-static void __smp_call_function(void (*func) (void *info), void *info,
- int nonatomic, int wait)
+int __cpuinit init_smp_call(void)
{
- struct call_data_struct data;
- int cpus = num_online_cpus() - ...Subject: [PATCH] Fixed race: using potentially invalid pointer
When data->flags & CSD_FLAG_ALLOC is true, the data could be freed by the other processor before we check for CSD_FLAG_WAIT.
Also: removed old comment, doesn't quite fit anymore.
This is applied against Jens' git tree w/ the ia64 additional commit.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
arch/ia64/kernel/smp.c | 5 ++---
arch/x86/kernel/smp_32.c | 5 ++---
arch/x86/kernel/smp_64.c | 5 ++---
3 files changed, 6 insertions(+), 9 deletions(-)
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index 521bc52..ad153e2 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -407,8 +407,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
{
struct call_single_queue *dst;
unsigned long flags;
- /* prevent preemption and reschedule on another processor */
- int ipi;
+ int ipi, wait_done = data->flags & CSD_FLAG_WAIT;
/* Can deadlock when called with interrupts disabled */
WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
@@ -424,7 +423,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
if (ipi)
send_IPI_single(cpu, IPI_CALL_FUNC_SINGLE);
- if (data->flags & CSD_FLAG_WAIT) {
+ if (wait_done) {
/* Wait for response */
while (data->flags)
cpu_relax();
diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dcbb89c..8239814 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -638,8 +638,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
cpumask_t mask = cpumask_of_cpu(cpu);
struct call_single_queue *dst;
unsigned long flags;
- /* prevent preemption and reschedule on another processor */
- int ipi;
+ int ipi, wait_done = data->flags & CSD_FLAG_WAIT;
/* Can deadlock when called with interrupts disabled */
WARN_ON((data->flags & CSD_FLAG_WAIT) && irqs_disabled());
@@ -655,7 +654,7 @@ void ...Thanks, I'll split this into 3 parts and apply them. -- Jens Axboe --
Very cool stuff. I think I can use it for cpu isolation purposes. ie Isolating a cpu from the io activity. You may have noticed that I started a bunch of discussion on CPU isolation. One thing that came out of that is the suggestion to use cpusets for managing this affinity masks. We're still discussing the details, the general idea is to provide extra flags in the cpusets that enable/disable various activities on the cpus that belong to the set. For example in this particular case we'd have something like "cpusets.io" flag that would indicate whether cpus in the set are allowed to to the IO or not. In other words: /dev/cpuset/io (cpus=0,1,2; io=1) /dev/cpuset/no-io (cpus=3,4,5; io=0) I'm not sure whether this makes sense or not. One advantage is that it's more dynamic and more flexible. If for example you add cpu to the io cpuset it will automatically start handling io requests. btw What did you mean by "to see if it really helps the larger end of the scale", what problem were you guys trying to solve ? I'm guessing cpu isolation would probably be an unexpected user of io cpu affinity :). Max --
The code posted here works on the queue level, where as you want this to be a global setting. So it'll require a bit of extra stuff to handle Nope, I didn't really consider isolation :-) It's meant to speed up IO on larger SMP systems by reducing cache line contention (or bouncing) by keeping data and/or locks local to a CPU (or a set of CPUs). -- Jens Axboe --
Your suggestion worked Jens, will do some benchmarking, and try to figure out why on the side...
Subject: [PATCH] IA64 boots with direct call of generic init single data
Need to figure out why it needs to be done earlier on ia64.
Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
arch/ia64/kernel/setup.c | 2 ++
arch/ia64/kernel/smp.c | 7 -------
2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 4aa9eae..36a0fe5 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -518,6 +518,8 @@ setup_arch (char **cmdline_p)
acpi_boot_init();
#endif
+ generic_init_call_single_data();
+
#ifdef CONFIG_VT
if (!conswitchp) {
# if defined(CONFIG_DUMMY_CONSOLE)
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index d8ee005..04ba9f8 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -113,13 +113,6 @@ stop_this_cpu (void)
DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
-int __cpuinit init_smp_call(void)
-{
- generic_init_call_single_data();
- return 0;
-}
-core_initcall(init_smp_call);
-
void
cpu_die(void)
{
--
1.5.2.5
--
Sometimes I find that the __init and friends need some include magic to actually work, even if it compiles and links just fine. If you could pick at it a bit once the benchmark is out of the way, that would be great. This is a fine work-around for now. -- Jens Axboe --
