We have been looking into the linux kernel direct IO scalability issues with
database workloads. Comments and suggestions on our below experiments are
welcome.In the linux kernel, direct IO requests are not batched at the block layer.
i.e, as a new request comes in, the request get directly submitted to the
IO controller on the same cpu that the request originates. And the IO completion
likely happens on a different cpu which is processing interrupts. This results
in cacheline bouncing of some of the hot kernel cachelines (like timers, scsi
cmds, slab, sched, etc) and is becoming an important scalability issue
as the number of cpus and distance between them increase with multi-core
and numa.In case of the controllers which support RIO/ZIO modes (like some qla2xxx),
IO submission path on each cpu also checks if there any completed
IO commands in the response queue and triggers softirq on the same cpu
to process the completed commands. This results in each logical cpu in the
system spending sometime in softirq processing and this causes contentions in
spinlocks and other data structures.Not sure when the IO controllers with multiple request/response queues will be
available in the market. In that case we can dedicate each queue pair
to group of cpus(/a node) and be done with this problem.In the absence of such HW today, we were looking into possible solutions for
these problemsa and did couple of experiments as part of this.In the first experiment, we removed the completed IO command processing during
IO submission. This will now result in the processing of IO commands only
on the cpu receiving interrupts. This will result in more interrupts
(as we are not doing any proactive processing) but wanted to see if this is a
win over each cpu doing the softirq processing. This gave a 1.36% performance
improvement on a x86_64 MP system (total 16 logical cpus) and on two
node ia64 platform(2 nodes, 8 cores, 16 threads) we got 1.5% improvement
[please look at observation #1 below...
Hi guys,
Just had another way we might do this. Migrate the completions out to
the submitting CPUs rather than migrate submission into the completing
CPU.I've got a basic patch that passes some stress testing. It seems fairly
simple to do at the block layer, and the bulk of the patch involves
introducing a scalable smp_call_function for it.Now it could be optimised more by looking at batching up IPIs or
optimising the call function path or even mirating the completion event
at a different level...However, this is a first cut. It actually seems like it might be taking
slightly more CPU to process block IO (~0.2%)... however, this is on my
dual core system that shares an llc, which means that there are very few
cache benefits to the migration, but non-zero overhead. So on multisocket
systems hopefully it might get to positive territory.---
Index: linux-2.6/arch/x86/kernel/smp_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -321,6 +321,99 @@ void unlock_ipi_call_lock(void)
spin_unlock_irq(&call_lock);
}+struct call_single_data {
+ struct list_head list;
+ void (*func) (void *info);
+ void *info;
+ int wait;
+};
+
+struct call_single_queue {
+ spinlock_t lock;
+ struct list_head list;
+};
+static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
+
+int __cpuinit init_smp_call(void)
+{
+ int i;
+
+ for_each_cpu_mask(i, cpu_possible_map) {
+ spin_lock_init(&per_cpu(call_single_queue, i).lock);
+ INIT_LIST_HEAD(&per_cpu(call_single_queue, i).list);
+ }
+ return 0;
+}
+core_initcall(init_smp_call);
+
+/*
+ * this function sends a 'generic call function' IPI to all other CPU
+ * of the system defined in the mask.
+ */
+int smp_call_function_fast(int cpu, void (*func)(void *), void *info,
+ int wait)
+{
+ struct call_single_data *data;
+ struct call_single_queue *dst = &per_cpu(call_single_queue, cp...
Hi Nick, This was the first experiment I tried on a quad core four
package SMP platform. And it didn't show much improvement in my
prototype(my protoype was migrating the softirq to the kblockd context
of the submitting CPU).In the OLTP workload, quite a bit of activity happens below the block layer
and by the time we come to softirq, some damage is done in
slab, scsi cmds, timers etc. Last year OLS paper
(http://ols.108.redhat.com/2007/Reprints/gough-Reprint.pdf)
shows different cache lines that are contended in the kernel for the
OLTP workload.Softirq migration should atleast reduce the cacheline contention that
happens in sched and AIO layers. I didn't spend much time why my softirq
migration patch didn't help much (as I was behind bigger birds of migrating
IO submission to completion CPU at that time). If this solution has
less side-effects and easily acceptable, then we can analyze the softirq
migration patch further and findout the potential.While there is some potential with the softirq migration, full potential
can be exploited by making the IO submission and completion on the same CPU.thanks,
suresh
--
I think you could do that lockless if you use a similar data structure
as netchannels (essentially a fixed size single buffer queue with atomic
exchange of the first/last pointers) and not using a list. That would avoid
at least one bounce for the lock and likely another one for the list
manipulation.Also the right way would be to not add a second mechanism for this,
but fix the standard smp_call_function_single() to support it.-Andi
--
That's pretty funny, I did pretty much the exact same thing last week!
The primary difference between yours and mine is that I used a more
private interface to signal a softirq raise on another CPU, instead of
allocating call data and exposing a generic interface. That put the
locking in blk-core instead, turning blk_cpu_done into a structure with
a lock and list_head instead of just being a list head, and intercepted
at blk_complete_request() time instead of waiting for an already raised
softirq on that CPU.Didn't get around to any performance testing yet, though. Will try and
clean it up a bit and do that.--
Jens Axboe--
Yeah I was looking at that... didn't really want to add the spinlock
overhead to the non-migration case. Anyway, I guess that sort of
fine implementation details is going to have to be sorted out with
results.
--
As Andi mentions, we can look into making that lockless. For the initial
implementation I didn't really care, just wanted something to play with
that would nicely allow me to control both the submit and complete side
of the affinity issue.--
Jens Axboe--
Sorry, late to the party ... it went to my steeleye address, not my
current one.Could you try re-running the tests with a low queue depth (say around 8)
and the card interrupt bound to a single CPU.The reason for asking you to do this is that it should emulate almost
precisely what you're looking for: The submit path will be picked up in
the SCSI softirq where the queue gets run, so you should find that all
submit and returns happen on a single CPU, so everything gets cache hot
there.James
p.s. if everyone could also update my email address to the
hansenpartnership one, the people at steeleye who monitor my old email
account would be grateful.--
Hi Nick,
When Matthew was describing this work at an LCA presentation (not
sure whether you were at that presentation or not), Zach came up
with the idea that allowing the submitting application control the
CPU that the io completion processing was occurring would be a good
approach to try. That is, we submit a "completion cookie" with the
bio that indicates where we want completion to run, rather than
dictating that completion runs on the submission CPU.The reasoning is that only the higher level context really knows
what is optimal, and that changes from application to application.
The "complete on the submission CPU" policy _may_ be more optimal
for database workloads, but it is definitely suboptimal for XFS and
transaction I/O completion handling because it simply drags a bunch
of global filesystem state around between all the CPUs running
completions. In that case, we really only want a single CPU to be
handling the completions.....(Zach - please correct me if I've missed anything)
Looking at your patch - if you turn it around so that the
"submission CPU" field can be specified as the "completion cpu" then
I think the patch will expose the policy knobs needed to do the
above. Add the bio -> rq linkage to enable filesystems and DIO to
control the completion CPU field and we're almost done.... ;)Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
Yeah, I think Nick's patch (and Jens' approach, presumably) is just the
sort of thing we were hoping for when discussing this during Matthew's talk.I was imagining the patch a little bit differently (per-cpu tasks, do a
wake_up from the driver instead of cpu nr testing up in blk, workYeah, that seems pretty straight forward.
We might need some logic for noticing that the desired cpu has been
hot-plugged away while the IO was in flight, it occurs to me.- z
--
per-cpu tasks/wq's might be better, it's a little awkward to jump
the softirq completion stuff already handles cpus going away, at least
with my patch that stuff works fine (with a dead flag added).--
Jens Axboe--
one caveat btw; when the multiqueue storage hw becomes available for Linux,
we need to figure out how to deal with the preference thing; since there
honoring a "non-logical" preference would be quite expensive (it means
you can't make the local submit queues lockless etc etc), so before we go down
the road of having widespread APIs for this stuff.. we need to make sure we're
not going to do something that's going to be really stupid 6 to 18 months down the road.
--
As far as I'm concerned, so far this is just playing around with
affinity (and to some extents taking it too far, on purpose). For
instance, my current patch can move submissions and completions
independently, with a set mask or by 'binding' a request to a CPU. Most
of that doesn't make sense. 'complete on the same CPU, if possible'
makes sense and would fit fine with multi-queue hw.Moving submissions at the block layer to a defined set of CPUs is a bit
silly imho, it's pretty costly and it's a lot more sane simply bind the
submitters instead. So if you can set irq affinity, then just make the
submitters follow that.--
Jens Axboe--
well.. kinda. One of the really hard parts of the submit/completion stuff is that
the slab/slob/slub/slib allocator ends up basically "cycling" memory through the system;
there's a sink of free memory on all the submission cpus and a source of free memory
on the completion cpu. I don't think applications are capable of working out what is
best in this scenario..--
Applications as in "anything that calls submit_bio()". i.e, direct I/O,
filesystems, etc. i.e. not userspace but in-kernel applications.In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
contention of global structures in XFS. By controlling where completions are
delivered, we can greatly reduce this contention, especially on large,
mulitpathed devices that deliver interrupts to multiple CPUs that may be far
distant from each other. We have all the state and intelligence necessary
to control this sort policy decision effectively.....Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
Hi Dave,
Thanks for taking a look at the patch... yes it would be easy to turn
this bit of state into a more flexible cookie (eg. complete on submitter;
complete on interrupt; complete on CPUx/nodex etc.). Maybe we'll need
something that complex... I'm not sure, it would probably need more
fine tuning. That said, I just wanted to get this approach out there
early for rfc.I guess both you and Arjan have points. For a _lot_ of things, completing
on the same CPU as submitter (whether that is migrating submission as in
the original patch in the thread, or migrating completion like I do).You get better behaviour in the slab and page allocators and locality
and cache hotness of memory. For example, I guess in a filesystem /
pagecache heavy workload, you have to touch each struct page, buffer head,
fs private state, and also often have to wake the thread for completion.
Much of this data has just been touched at submit time, so doin this on
the same CPU is nice...I'm surprised that the xfs global state bouncing would outweigh the
bouncing of all the per-page/block/bio/request/etc data that gets touched
during completion. We'll see.--
per-page/block.bio/request/etc is local to a single I/O. the only
penalty is a cacheline bounce for each of the structures from one
CPU to another. That is, there is no global state modified by these
completions.The real issue is metadata. The transaction log I/O completion
funnels through a state machine protected by a single lock, which
means completions on different CPUs pulls that lock to all
completion CPUs. Given that the same lock is used during transaction
completion for other state transitions (in task context, not intr),
the more cpus active at once touches, the worse the problem gets.Then there's metadata I/O completion, which funnels through a larger
set of global locks in the transaction subsystem (e.g. the active
item list lock, the log reservation locks, the log state lock, etc)
which once again means the more CPUs we have delivering I/O
completions, the worse the problem gets.Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
Yeah, but it is going from _all_ submitting CPUs to the one completing
CPU. So you could bottleneck the interconnect at the completing CPU
just as much as if you had cachelines being pulled the other way (ie.OK, once you add locking (and not simply cacheline contention), then
the problem gets harder I agree. But I think that if the submitting
side takes the same locks as log completion (eg. maybe for starting a
new transaction), then it is not going to be a clear win either way,
and you'd have to measure it in the end.--
Hi Nick,
Why do we need smp_mb() here (maybe add a comment to keep
Andrew/checkpatch happy)?Pekka
--
Yeah, definitely... it's just a really basic RFC, but I should get
into the habit of just doing it anyway.Thanks,
Nick
--
This was on an SMP system? These issues are much more pronounced on a NUMA
Yes. The issue is even worse if the submission comes from a remote node.
F.e. If we have a system with a scsi controller on node 2. Now I/O
submission on node 1 and completion on node 2. In that case the
cacheline has to be transferred across the NUMA interlink.However, you cannot avoid running the completion on the node where the
device sits. The device has all sorts of control structures and if you
would handle the completion on node 1 then it would have to transfer lots
of cachelines that contain device state to node 1.I think it is better to leave things as is. Or have the I/O submission be
I think that is the right approach. This will also help in cases where I/O
devices can only be accessed from a certain node (NUMA device addressRight.
-
If the device is capable of multi queues, then some of the control structures,
In the absence of specialized controllers, it is best to keep the control
So any suggestions for making this clean and acceptable to everyone?
thanks,
suresh
-
It is obviously a good idea to hand over the IO at the point which
requires the least number of cachelines to be moved, and I think doing
it in the block layer is right. Mostly you have to convince the block
and driver maintainers I guess.The scheduler really should be made interrupt-load aware anyway, so I
don't have a problem with changing that; or scheduling kblockd at a
higher priority, but I don't know if SCHED_FIFO is a good idea. Couldn't
it be done in a softirq instead?Latency for IO migration could be the most difficult problem to solve
really. You don't give much details of the workload, profiles, etc... I
hope this is for a real world test? Can the locking be improved in simpler
ways first?Just some random questions...
It looks like the main source of cacheline bouncing you're eliminating
Why is the queue allowed to go empty in the first place in an IO critical
workload?Are you loading up each CPU with as many disks as it can possibly handle
plus a few more? If so, is that realistic? (I honestly don't know).You say that you'd like to do this for direct IO only, but if it is more
efficient, why not for buffered IO as well? (or is it not more efficient
for buffered IO? if not, why?)AFAIKS, you'd still have significant queue_lock contention from other
CPUs inserting requests into the list? What IO scheduler are you using?
I assume noop... as a crazy experiment, what happens if you create per-cpu
request queues?-
Yes, softirq context is one way. But just didn't want to penalize the running
task by taking away some of its cpu time. With CFS micro accounting, perhaps
we can track irq, softirq time and avoid penalizing the running task's cpuImprovement numbers quoted are from the OLTP database workload. We can look
This workload is using direct IO and there is no batching at the block layer
There is 3-4% iowait time in the system. So the cpu's are not 100% busy,
It is applicable for both direct IO and buffered IO. But the implementations
will differ. For example in buffered IO, we can setup in such a way that theCorrect. We have more potential to explore. Current implementation
or in other words, each kblockd thread catering multiple request queues
(perhaps one for each cpu or one for group of cpu's).softirq context and each kblockd thread handling multiple request queues will
lead to further improvements.thanks,
suresh
-
But you "penalize" the running task in the completion handler as well
anyway. Doing this with a SCHED_FIFO task is sort of like doing interruptSo you aren't putting concurrent requests into the queue? Sounds like
It would be nice to be doing that anyway. But unplug via request submission
-
Yes.
Ingo, in general with CFS micro accounting, we should be able to avoid
I am not recommending SCHED_FIFO. I will take a look at softirq
Nick remember that there are hundreds of disks in this setup and at
Ok. Currently the patch handles both direct and buffered IO. While making
improvements to this patch I will make sure that both the paths take
advantage of this.thanks,
suresh
-
Well if there is 2 requests per disk, that's a good thing; you won't
need to unplug. If there is only 1, then as well as the plugging cost,
the hardeware loses some ability to pipeline things effectively.I'm not saying the kernel shouldn't be improved in the latter case, but
if you're looking for performance, it is nice to ensure you have at
least 2 requests. Presumably you're using some pretty well tuned db
software though, so I guess this is not always possible.Do you have stats for these things (queue empty vs not empty events,
Sounds good!
-
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Jeff Garzik | Re: fallocate-implementation-on-i86-x86_64-and-powerpc.patch |
git: | |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Arjan van de Ven | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Natalie Protasevich | [BUG] New Kernel Bugs |
