blk_recalc_rq_segments assumes that any segments can be merged in the
case of BIOVEC_VIRT_MERGEABLE && !BIOVEC_VIRT_OVERSIZE. However, an
IOMMU can't merge segments if the total length of the segments is
larger than max_segment_size (the LLD restriction).
Due to this bug, a LLD may get the larger number of segments than
nr_hw_segments because the block layer puts more segments in a request
than it should do.
This bug could happen on alpha, parisc, and sparc, which use VMERGE.
Like blk_hw_contig_segment() does, this patch uses hw_seg_size for
simplicity, which is a bit larger than an exact value (we don't need
BIOVEC_VIRT_START_SIZE here).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Miller <davem@davemloft.net>
---
block/blk-merge.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..39a22f8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -83,7 +83,8 @@ void blk_recalc_rq_segments(struct request *rq)
continue;
}
new_segment:
- if (BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
+ if (hw_seg_size + bv->bv_len <= q->max_segment_size &&
+ BIOVEC_VIRT_MERGEABLE(bvprv, bv) &&
!BIOVEC_VIRT_OVERSIZE(hw_seg_size + bv->bv_len))
hw_seg_size += bv->bv_len;
else {
--
1.5.5.GIT
--
Parisc doesn't use virtual merge accounting (there is variable for it but it's always 0). On sparc64 it is broken anyway with or without your patch. And alpha alone doesn't justify substantial code bloat in generic block layer. So I propose this patch to drop it at all. A further patch could be made to drop bi_hw_segments, nr_hw_segments and similar --- they have no use for anything except alpha and because there are few alpha computers and alpha is discontinued platform, the logic for attempting to predict number of hardware segments can't be well tested and Drop virtual merge accounting in bio layer (not the virtual merging itself). It is used only on Sparc64 (where it is broken and needs to be disabled) and on Alpha. This logic is very hard to maintain (the generic code in block/blk-merge.c tries to attempt to predict how architecture-specific IOMMU will merge the segments) and Alpha architecture alone doesn't justify the maintenance and code-bloat cost. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- arch/parisc/kernel/setup.c | 5 --- arch/x86/kernel/pci-dma.c | 6 --- block/blk-merge.c | 72 +++------------------------------------------ fs/bio.c | 6 +-- include/asm-alpha/io.h | 3 - include/asm-ia64/io.h | 26 +--------------- include/asm-parisc/io.h | 6 --- include/asm-powerpc/io.h | 7 ---- include/asm-sparc64/io.h | 1 include/asm-x86/io_64.h | 3 - include/linux/bio.h | 15 --------- 11 files changed, 10 insertions(+), 140 deletions(-) Index: linux-2.6.26-fast/arch/parisc/kernel/setup.c =================================================================== --- linux-2.6.26-fast.orig/arch/parisc/kernel/setup.c 2008-07-15 14:19:01.000000000 +0200 +++ linux-2.6.26-fast/arch/parisc/kernel/setup.c 2008-07-15 14:19:18.000000000 +0200 @@ -57,11 +57,6 @@ int parisc_bus_is_phys __read_mostly = 1 EXPORT_SYMBOL(parisc_bus_is_phys); #endif -/* ...
On Tue, 15 Jul 2008 09:37:05 -0400 (EDT) Hmm, really? Looks like PARISC IOMMUs (ccio-dma.c and sba_iomm.c) set Yeah, we need to modify SPARC64 IOMMU code (I'm not sure that it's Jens, what do you think about removing VMERGE code? --
Even if we fix it now, the question is: how long it will stay fixed? Until someone makes another change to struct device that restricts boundaries on some wacky hardware. --
On Tue, 15 Jul 2008 10:37:58 -0400 (EDT) I'm not sure how the boundary restriction of a device can break the VMERGE accounting. --
Because block layer code doesn't know anything about the device, pci access restrictions and so on. Someone already broken DaveM's Sparc64 merging by adding boundaries (it was broken even before, but these boundary checks made it worse). Mikulas --
On Tue, 15 Jul 2008 11:46:46 -0400 (EDT) Not true, the block layer knows about the device restrictions like DMA boundary. But it's not the point here because the boundary restriction doesn't matter for the VMERGE accounting. An IOMMU just returns an error if it can't allocate an I/O space fit for the device restrictions. Please give me an example how the boundary restriction of a device can break the VMERGE accounting and an IOMMU if you aren't still sure. --
You have dma_get_seg_boundary and dma_get_max_seg_size. On sparc64, adding one of these broken VMERGE accounting (the VMERGE didn't happen past 64-kb boundary and bio layer thought that VMERGE would be possible). And if you fix this case, someone will break it again, sooner or later, by adding new restriction. Mikulas --
On Wed, 16 Jul 2008 14:02:27 -0400 (EDT) If the device has 64KB boundary restriction, the device also has max_seg_size restriction of 64KB or under. So the vmerge acounting What is your new restriction? All restrictions that IOMMUs need to know are dma_get_seg_boundary and dma_get_max_seg_size. --
We don't know what happens in the future. And that is the problem that we don't know --- but we have two pieces of code (blk-merge and iommu) that try to calculate the same number (number of hw segments) and if they get different result, it will crash. If the calculations were done at one place, there would be no problem with that. Mikulas --
On Thu, 17 Jul 2008 07:50:24 -0400 (EDT) I don't think that your argument, 'the problem that we don't know', is true. With the vmerge accounting, we calculate at two places. So if we add a new restriction, we need to handle it at two places. It's a logical result. Of course, it's easier to calculate at one place rather than two places. But 'we don't know what restriction we will need' isn't a problem. BTW, as I've already said, I'm not against removing the vmerge accounting from the block layer. --
I have a question. Does the block layer know of the IOMMU in use for the device? can it call into the IOMMU to calculate the restriction? Thanks Boaz --
Yes and no. The parameter PCI_DMA_BUS_IS_PHYS is set if the platform doesn't have one. Nowadays, that's not enough; with VT and bypass what the system really needs to know is if the device will be using the iommu. The idea of calling into the platform iommu code was considered when all this was done, but it was rejected. Function pointer calls are incredibly expensive on most platforms that at that time had iommus. The best way was to construct a theoretical parametrisation of an iommu and get the block layer to follow that model. James --
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> I am also, as stated, not against this. Fujita-san, please proposage a patch so that we can put this issue behind us :-) --
Few days ago I created this.
Another task would be to remove nr_hw_segments from request, bio and queue
parameters (if this patch is accepted).
Mikulas
---
arch/parisc/kernel/setup.c | 5 ---
arch/x86/kernel/pci-dma.c | 6 ---
block/blk-merge.c | 72 +++------------------------------------------
drivers/parisc/ccio-dma.c | 2 -
drivers/parisc/sba_iommu.c | 2 -
fs/bio.c | 6 +--
include/asm-alpha/io.h | 3 -
include/asm-ia64/io.h | 26 +---------------
include/asm-parisc/io.h | 6 ---
include/asm-powerpc/io.h | 7 ----
include/asm-sparc64/io.h | 1
include/asm-x86/io_64.h | 3 -
include/linux/bio.h | 15 ---------
13 files changed, 10 insertions(+), 144 deletions(-)
Index: linux-2.6.26-fast/arch/parisc/kernel/setup.c
===================================================================
--- linux-2.6.26-fast.orig/arch/parisc/kernel/setup.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/arch/parisc/kernel/setup.c 2008-07-15 16:20:15.000000000 +0200
@@ -57,11 +57,6 @@ int parisc_bus_is_phys __read_mostly = 1
EXPORT_SYMBOL(parisc_bus_is_phys);
#endif
-/* This sets the vmerge boundary and size, it's here because it has to
- * be available on all platforms (zero means no-virtual merging) */
-unsigned long parisc_vmerge_boundary = 0;
-unsigned long parisc_vmerge_max_size = 0;
-
void __init setup_cmdline(char **cmdline_p)
{
extern unsigned int boot_args[];
Index: linux-2.6.26-fast/arch/x86/kernel/pci-dma.c
===================================================================
--- linux-2.6.26-fast.orig/arch/x86/kernel/pci-dma.c 2008-07-15 16:19:53.000000000 +0200
+++ linux-2.6.26-fast/arch/x86/kernel/pci-dma.c 2008-07-15 16:20:15.000000000 +0200
@@ -30,11 +30,6 @@ int no_iommu __read_mostly;
/* Set this to 1 if there is a HW IOMMU in the system */
int iommu_detected __read_mostly = 0;
-/* This tells the BIO block layer to assume merging. Default to off
- ...I think we've already established that the code in question is correct
and functional,
As far as I can tell, virtual merging has always been broken on ppc, so
it shouldn't enable it. It looks like at some point in history sparc
went from a working vmerge to a non working one (by copying the broken
ppc code), so the correct fix for both of these arches is simply to turn
off virtual merging.
ppc claims to turn off virtual merging, but in fact the define is
broken. Sparc should now follow ppc.
Try this patch
James
---
diff --git a/include/asm-powerpc/io.h b/include/asm-powerpc/io.h
index 8b62782..0f3212c 100644
--- a/include/asm-powerpc/io.h
+++ b/include/asm-powerpc/io.h
@@ -715,7 +715,7 @@ static inline void * phys_to_virt(unsigned long address)
* to coalesce sglists that happen to have been mapped in a contiguous
* way by the iommu
*/
-#define BIO_VMERGE_BOUNDARY 0
+#undef BIO_VMERGE_BOUNDARY
/*
* 32 bits still uses virt_to_bus() for it's implementation of DMA
diff --git a/include/asm-sparc64/io.h b/include/asm-sparc64/io.h
index 3158960..1da5642 100644
--- a/include/asm-sparc64/io.h
+++ b/include/asm-sparc64/io.h
@@ -16,7 +16,9 @@
/* BIO layer definitions. */
extern unsigned long kern_base, kern_size;
#define page_to_phys(page) (page_to_pfn(page) << PAGE_SHIFT)
-#define BIO_VMERGE_BOUNDARY 8192
+
+/* virtual merging doesn't work on sparc now */
+#undef BIO_VMERGE_BOUNDARY
static inline u8 _inb(unsigned long addr)
{
--
From: James Bottomley <James.Bottomley@HansenPartnership.com> I'd rather remove the vmerge code, it doesn't buy us anything, and for something so complex and so hard to keep working correctly it's existence is far from justified these days. --
You can ... as soon as BIO_VMERGE_BOUNDARY is undefined or set to zero, it gets compiled out of the block code. Since we're using it successfully in parisc, I don't want the block code removed, but I don't see a reason to force other architectures to use it. However, it has two use cases. One is the legacy one of making rather dumb I/O cards perform better (which is the primary on on parisc), but there is a current one making huge transfers go through SCSI using using the sg_table code. That latter is pretty vital to me since I have to keep the code working, but I don't really have any SCSI cards that can take advantage of it without virtual merging. As a slight irony, IBM is trying to persuade me that a ppc would be better than a parisc for big endian I/O testing ... so I might just be seeing if I can make virtual merging work on power too. James --
From: James Bottomley <James.Bottomley@HansenPartnership.com> All of this is gibberish, we've been over this a few times already in this thread. For a dumb I/O card, you advertise SG_ALL capabilities, the IOMMU is going to merge things as it would have anyways, and you have code in the driver to advance SG entries after each "dumb I/O". There is zero value to the vmerge code, the real gains are being realized already. --
Really? I must have missed the proposed replacement for the Not that dumb ... they just have a limited number of SG slots. We wouldn't want to run them as spoon fed PIO because that really would There is value to me in my testbed, which I can't achieve any other way (except by buying different SCSI cards). As I said, you can compile it out on sparc just fine. I wish to keep it running for parisc, so I'll maintain it. If it ever bit rots out of parisc like it has done for the other architectures, then feel free to remove it. James --
So try to #define BIO_VMERGE_BOUNDARY 0 for Pa-Risc and tell us what performance degradation do you see (and what driver do you use and what is the I/O pattern). If you show something specific, we can consider that --- but you haven't yet told us anything, except generic talk. Mikulas --
You keep ignoring inconvenient facts. For about the third time: I run a test bed for sg_tables (large chaining of requests). This runs on parisc using virtual merging (has to because the final physical table size can't go over the sg list of the SCSI card). If I turn off virtual merging I can no longer test sg_tables in vanilla kernels. James --
What sg_tables test do you mean? What does the test do? Why couldn't you run the test if BIO_VMERGE_BOUNDARY is 0? Normal I/O obviously can work with BIO_VMERGE_BOUNDARY 0, the kernel will just send more smaller requests. Mikulas --
Look, if you don't really understand what I'm doing, it's not really my job to educate you. The sg_table discussions are on marc.info, mainly on the SCSI lists; just look for 'sg chaining' in the header (need to use google site ... marc's search is bad). You can complain if the code is impacting you ... but I believe I've optimised it so it isn't. Your basic problem amounts to you not liking me doing something that has no impact on you ... I'm afraid that's what freedom leads to (shocking, I know). James --
Chaining of sg_tables is used for drivers with big sg tables --- and vmerge counting is used for drivers with small sg tables. So what do they have in common? Summary, what I mean: * in blk-merge.c, you have 85 lines, that is 16% of the size of the file, devoted to counting of hw_segments * it is only used on two architectures, one already outdated (alpha), the other being discontinued (pa-risc). On all the other architectures, hw_segments == phys_segments * it is prone to bugs and hard to maintain, because the same value must be calculated in blk-merge.c and in architectural iommu functions --- if the value differs, you create too long request, corrupt kernel memory and crash (happened on sparc64). Anyone changing blk-merge in the future will risk breaking something on the architectures that use BIO_VMERGE_BOUNDARY --- and because these architectures are so rare, the bug will go unnoticed for long time --- like in the case of sparc64. * you are just talking how this code is important for performance without showing any single proof that it really is (temporarily disable hw_segments accounting by defining BIO_VMERGE_BOUNDARY 0 and get the numbers). Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> I completely agree with this point. This VMERGE stuff is now a non-trivial maintainence burdon because anyone who wants to hack on the block layer has to be mindful of VMERGE but is very unlikely to have access to a system that it can even be tested on. And the answer isn't "James Bottomly will test your changes for you", because that simply doesn't scale. I still say we should definitely remove the VMERGE code. It's not worth the maintainence hassle just for some SG chaining test rig on some obscure platform. I really only hear one person who really wants this code around any more. Is that the Linux way? :-) Can't he patch it into his tree when he needs it or write an alternative way to stress the SG chaining code? He has the source, right? :-))) --
So you think the parametrisation in the block layer is the wrong way to approach the problem? On this argument your next patch should be removing physical merging as well because it also relies on a I'm sorry sparc broke, really I am ... but you changed your iommu code from one with a working vmerge to one without and failed to turn off vmerging. Partly it wasn't noticed because at 2*PAGE_SIZE you have a strange vmerge boundary, so it's harder to notice. However, I can't extrapolate that just because this happened on sparc it will inevitably Actually, parisc will test your code we have a PAGE_SIZE vmerge OK ... well you've had your say and so have I. The code in question is You're advocating an out of tree patch as a solution? I didn't know you'd been appointed RHEL maintainer ;-) James --
From: James Bottomley <James.Bottomley@HansenPartnership.com> The vmerge boundary on sparc64 was 8K which is equal to sparc64's base PAGE_SIZE. I don't know where you get that 2*PAGE_SIZE from. --
On Thu, 24 Jul 2008 17:49:14 -0400 (EDT) VMERGE enables you to handle a large request even with drivers with BTW, alpha IOMMU can't handle VMERGE. But the IOMMU has the code to handle VMERGE so one-line patch can fix the IOMMU. As I said before, can we leave this to Jens, keeping or removing VMERGE? Seems that I see the same arguments again and again. --
This "business based" argument should be ignored. Followed to the limit, there would only be one architecture left... Dave -- J. David Anglin dave.anglin@nrc-cnrc.gc.ca National Research Council of Canada (613) 990-0752 (FAX: 952-6602) --
From: Mikulas Patocka <mpatocka@redhat.com> I'm fine with this: Acked-by: David S. Miller <davem@davemloft.net> --
That's correct. The size and boundary depend on the type of IOMMU (ccio or sba) so the vmerge boundary parameters are set up in the iommu driver Actually, it's code I did. There are plusses and minusses to all of this. The original vmerge code was done for sparc ... mainly because the benefits of virtual merging can offset the cost of having to use the iommu. However, most architectures didn't use it. When I fixed it up to work for parisc (and introduced the parameters) we were trying to demonstrate that using it was feasible. The idea behind vmerging is that assembling and programming sg lists is expensive, so you want to do it once. Either in the iommu or in the driver sg list, but not in both. There is evidence that it saves around 7% or so on drivers. However, for architectures that can do it, better savings are made simply by lifting the iommu out of the I/O path (so called bypass mode). I suspect with IOMMUs coming back (and being unable to be bypassed) with virtualisation, virtual merging might once more become a significant value. James --
The problem is with vmerge accounting in block layer (that is what I'm proposing to remove), not with vmerge itself. Vmerge accounting has advantages only if you have device with small amount of sg slots --- it allows the block layer to create request that has higher number of segments then the device. If you have device with for example 1024 slots, the virtual merge accounting has no effect, because the any request will fit into that size. Even without virtual merge accounting, the virtual merging will happen, so there will be no performance penalty for the controller --- the controller will be programmed with exactly the same number of segments as if virtual merge accounting was present. (there could be even slight positive performance effect if you remove accounting, because you burn less CPU cycles per request) If you have device will small number of sg slots (16 or so), vmerge accounting can improve performance by creating requests with more than 16 segments --- the question is: is there any such device? And is the device performance-sensitive? (i.e. isn't it such an old hardware where no one I suppose that no one would manufacture new SCSI card with 16 or 32 sg slots these days, so the accounting of hardware segments has no effect on modern hardware. --
I don't think that's true ... otherwise parisc would be falling over This isn't really true either. A lot of devices with a high sg slot count are still less efficient than an iommu for programming. Even if they're not, on parisc we have to program the iommu, we can't bypass, so it still makes sense to only have one large sg list (in the iommu) and one small one (in the device). Having two large ones reduces Yes there is. Both the iommu and the device have to traverse large SG lists. This is where the inefficiency lies. On PA, we use exactly the same number of iotlb slots whether virtual merging is in effect or not, but the device has an internal loop to go over the list. It's that loop that virtual merging reduces. Since the virtual merge computation is in line when the request is built (by design) it doesn't really detract from the throughput and the cost It's not about accounting, it's about performance. There's a cost in every device to traversing large count sg lists. If you have to bear it in the iommu (which is usually more efficient because the iotlb tends to follow mmtlb optimisations) you can reduce the cost by eliminating it from the device. James --
You are mixing two ideas here: (1) virtual merging --- IOMMU maps discontinuous segments into continuous area that it presents to the device. (2) virtual merge accounting --- block layer tries to guess how many segments will be created by (1) and merges small requests into big ones. The resulting requests are as big that they can't be processed by the --- for these devices virtual merging (1) improves performance, but Virtual merge accounting (2) is about fitting a request. It is block layer The purpose of (1) virtual merging is to save device's sg slots. The purpose of (2) virtual merge accounting is to allow block layer to build larger requests. If you remove virtual merge accounting, it will cause no That's why I'm proposing to remove virtual merge accounting (2), but leave virtual merging (1) itself. The accounting doesn't reduce number of sg slots. --
No ... I'm not ... the virtual merge implementation requires the block layer to get this accounting right, otherwise the iommu code can end up doing the wrong thing. You're proposing to eliminate the difference between max_phys_segments Yes, but it's gains very little ... architectures that don't want it can already turn it off, and it's useful for those, like parisc, who do. James --
The virtual merge (1) can work even without accounting (2). IOMMU can Yes. Only for alpha and pa-risc, there is difference between these values. It increases maintainability of the code, reduces bloat and bugs. Mikulas --
It can, but it's not optimal ... and depends on max_phys_segments == That's not really a good reason. You can eliminate code because it's unused and unikely to be used or you redo it to better or fix it to be less buggy. You don't simply eliminate useful functionality that currently has in-tree users, however marginal you might opine those users to be. James --
So show a specific device where the virtual merge accounting is useful. (1) The device that is often used in alpha or pa-risc environments --- because the accounting is not used on other archs. (2) The device that is performance-sensitive --- not something outdated or unusual. (3) And the device that has limited sg-list size, so that generic I/O requests made by the kernel hit this limit. (if the sg-list is so big that nr_phys_segments of most requests fits into it, you don't need to count nr_hw_segments --- because nr_hw_segments < nr_phys_segments and nr_phys_segments already fits). [ the device that traverses its sg-list slowly doesn't fall into category (3), beacuse virtual merging would happen with or without nr_hw_segments accounting ] Mikulas --
