Re: [PATCH] BUG: nr_phys_segments cannot be less than nr_hw_segments

Previous thread: [GIT PULL] ALSA fixes (#2) by Takashi Iwai on Thursday, October 2, 2008 - 7:23 am. (1 message)

Next thread: [PATCH 1/5] xen: clean up label by Steven Rostedt on Thursday, October 2, 2008 - 8:05 am. (1 message)
From: Nikanth Karthikesan
Date: Thursday, October 2, 2008 - 7:29 am

This is a follow-up to my earlier mail http://lkml.org/lkml/2008/9/23/294 
([PATCH] BUG: ll_merge_requests_fn() updates req->nr_phys_segments wrongly)

It is possible for the merging code to create lesser no of phys segments than 
hw segments, but every hw segment needs atleast one new phys segment. This 
triggers the BUG() on scsi_init_sgtable() as blk_rq_map_sg() returns more no 
of segments than rq->nr_phys_segments

The following blktrace shows a sequence of bio's to trigger such condition on 
my machine with max_sectors_kb=512 & max_hw_sectors_kb=32767.

  8,0    0       12     1.943680621  2261  Q   R 6184075 + 55 [bash]
  8,0    0       13     1.943684671  2261  G   R 6184075 + 55 [bash]
  8,0    0       14     1.943690189  2261  P   N [bash]
  8,0    0       15     1.943692075  2261  I   R 6184075 + 55 [bash]
  8,0    0       16     1.943712119  2261  D   R 6184075 + 55 [bash]
  8,0    0       17     1.943718684  2261  Q   R 6184130 + 55 [bash]
  8,0    0       18     1.943721897  2261  G   R 6184130 + 55 [bash]
  8,0    0       19     1.943726576  2261  P   N [bash]
  8,0    0       20     1.943727763  2261  I   R 6184130 + 55 [bash]
  8,0    0       21     1.943731675  2261  Q   R 6184241 + 56 [bash]
  8,0    0       22     1.943735167  2261  G   R 6184241 + 56 [bash]
  8,0    0       23     1.943739497  2261  I   R 6184241 + 56 [bash]
  8,0    0       24     1.943742081  2261  Q   R 6184185 + 56 [bash]
  8,0    0       25     1.943744875  2261  M   R 6184185 + 56 [bash]
  8,0    0       26     1.943753535  2261  Q   R 6184352 + 55 [bash]
  8,0    0       27     1.943756538  2261  G   R 6184352 + 55 [bash]
  8,0    0       28     1.943760868  2261  I   R 6184352 + 55 [bash]
  8,0    0       29     1.943763522  2261  Q   R 6184407 + 55 [bash]
  8,0    0       30     1.943766316  2261  M   R 6184407 + 55 [bash]
  8,0    0       31     1.943770506  2261  Q   R 6184297 + 55 [bash]
  8,0    0       32     1.943772951  2261  F   R 6184297 + 55 [bash]
  8,0    0       33     ...
From: James Bottomley
Date: Thursday, October 2, 2008 - 8:03 am

Um, don't you mean this the other way around?  I can see this problem
occurring if the block layer gets tricked into doing a physical merge
where sector limits forbid a virtual merge.

The bug would appear to be that we sometimes only look at q->max_sectors
when deciding on mergability.  Either we have to insist on max_sectors
<= hw_max_sectors, or we have to start using min(q->max_sectors,
q->max_hw_sectors) for this.

James


--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 9:58 am

q->max_sectors MUST always be <= q->max_hw_sectors, otherwise we could
be sending down requests that are too large for the device to handle. So
that condition would be a big bug. The sysfs interface checks for this,
and blk_queue_max_sectors() makes sure that is true as well.

The fixes proposed still look weird. There is no phys vs hw segment
constraints, the request must adhere to the limits set by both. It's
mostly a moot point anyway, as 2.6.28 will get rid of the hw accounting
anyway.

-- 
Jens Axboe

--

From: James Bottomley
Date: Thursday, October 2, 2008 - 10:12 am

Yes, that seems always to be enforced.  Perhaps there are other ways of
tripping this problem ... I'm still sure if it occurs it's because we do

Agree all three proposed fixes look wrong.  However, if it's what I
think, just getting rid of hw accounting won't fix the problem because
we'll still have the case where a physical merge is forbidden by iommu
constraints ... this still needs to be accounted for.

What we really need is a demonstration of what actually is going
wrong ...

James


--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 10:13 am

Yep, IIRC we both asked for that the last time as well... Nikanth?

-- 
Jens Axboe

--

From: Nikanth Karthikesan
Date: Thursday, October 2, 2008 - 10:28 pm

Yes. This patch fixes it.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 5efc9e7..9c2fe49 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -392,7 +392,11 @@ static int ll_merge_requests_fn(struct request_queue *q, 
struct request *req,
 		return 0;
 
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
-	if (blk_phys_contig_segment(q, req->biotail, next->bio))
+	if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
+	    BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail),
+					__BVEC_START(next->bio)) &&
+	    !BIOVEC_VIRT_OVERSIZE(req->biotail->bi_hw_back_size
+					+ next->bio->bi_hw_front_size))
 		total_phys_segments--;
 

Sorry, I had already sent the blktrace and some code snippet that would 
reproduce the condition. Here is the module and user-space program that can 
trigger this condition.

diskbiomod.c
---

#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/blkdev.h>
#include <linux/miscdevice.h>
/*
** Shared between the driver module and application.
*/
#define MOD_NODE        "/dev/DiskBio"
#define MOD_NAME        "DiskBio"

/*
** IOCTLs
*/
#define DB_IOCTL_DISK           0x00000001
#define DB_IOCTL_GENDISK        0x00000002

/*
** MODULE Defines.
*/
MODULE_DESCRIPTION("DiskBio - Initiate Disk I/O Traffic");
MODULE_LICENSE("GPL");


/*
** Module Defines.
*/
#define MOD_VERSION	"09-11-2008-1"
#define NUM_BIO		8
#define NUMBER_OF_VECS	16
#define START_LBA	0x5e5c8b


/*
** Module Globals.
*/
static struct gendisk	*(*get_GenDisk)(dev_t, int *)	= NULL;
static int		ErrorsOccurred			= 0;


typedef struct {
	int	IoSize;
	char	*Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST	dbMemList[NUM_BIO] = {
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x7000,	NULL },
	{ 0x7000,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
};


/*
** Wait ...
From: FUJITA Tomonori
Date: Monday, October 6, 2008 - 10:24 am

On Thu, 2 Oct 2008 19:13:57 +0200

Possibly, blk_phys_contig_segment might miscalculate
q->max_segment_size?

blk_phys_contig_segment does:

req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size;

But it's possible that req->biotail and the previous bio are supposed
be merged into one segment? Then we could create too large segment
here.
--

From: Jens Axboe
Date: Friday, October 10, 2008 - 5:03 am

Hmm yes, that looks like it could indeed be a problem! We could fix this
with similar logic to what we used to do for the hw merging, keep track
of the current segment size that this bio belongs to, so it would end up
ala

        if (blk_phys_contig_segment(q, req->biotail, next->bio) &&
            rq->biotail->bi_seg_size + next->bio->bi_size <= q->max_segment_size) {
                total_phys_segments--;
                next->bio->bi_seg_size = rq->biotail->bi_seg_size + next->bio->bi_size;
        }

for the merge part.

-- 
Jens Axboe

--

From: FUJITA Tomonori
Date: Friday, October 10, 2008 - 5:32 am

On Fri, 10 Oct 2008 14:03:44 +0200


Yeah, exactly.

--

From: Jens Axboe
Date: Friday, October 10, 2008 - 5:37 am

Lets fix it. It wont be part of the initial merge, since it'll need some
dedicated testing, but we can get it there for 2.6.28. Shall I interpret
your message as willingness to write up the fix? :)

-- 
Jens Axboe

--

From: FUJITA Tomonori
Date: Friday, October 10, 2008 - 5:49 am

On Fri, 10 Oct 2008 14:37:19 +0200

Yeah, it's on this weekend todo list. :) I want to look at the code
again and make sure I correctly understand the problem.
--

From: Nikanth Karthikesan
Date: Thursday, October 2, 2008 - 8:05 am

Oops, that was incomplete information to reproduce the issue.

typedef struct {
	int	IoSize;
	char	*Mem;
} DB_MEM_LIST, *pDB_MEM_LIST;

DB_MEM_LIST	dbMemList[NUM_BIO] = {
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x7000,	NULL },
	{ 0x7000,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
	{ 0x6e00,	NULL },
};

		
// Allocate Memory.
char *Mem0 = kmalloc(dbMemList[0].IoSize, GFP_KERNEL);
char *Mem1 = kmalloc(dbMemList[1].IoSize, GFP_KERNEL);
char *Mem2 = kmalloc((dbMemList[2].IoSize + dbMemList[3].IoSize + 
dbMemList[4].IoSize), GFP_KERNEL);
char *Mem5 = kmalloc(dbMemList[5].IoSize, GFP_KERNEL);
char *Mem6 = kmalloc(dbMemList[6].IoSize, GFP_KERNEL);
char *Mem7 = kmalloc(dbMemList[7].IoSize, GFP_KERNEL);

dbMemList[0].Mem = Mem0;
dbMemList[1].Mem = Mem1;
dbMemList[2].Mem = Mem2;
dbMemList[3].Mem = dbMemList[2].Mem + dbMemList[2].IoSize;
dbMemList[4].Mem = dbMemList[3].Mem + dbMemList[3].IoSize;
dbMemList[5].Mem = Mem5;
dbMemList[6].Mem = Mem6;
dbMemList[7].Mem = Mem7;


int			Loop;
struct bio		*BioList[NUM_BIO]	= { NULL };
struct bio		*Bio			= NULL;
long			Sector;


/*
** Allocate BIOs.
*/
for(Loop = 0; Loop < NUM_BIO; Loop++)
	BioList[Loop] = bio_alloc(GFP_KERNEL, NUMBER_OF_VECS);

for(Loop = 0; Loop < NUM_BIO; Loop++) {
	Bio = BioList[Loop];
	Bio->bi_sector			= Sector;
	Bio->bi_bdev			= Bdev;
	Bio->bi_end_io			= db_end_io;
	if (Loop == 0) {
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[Loop].Mem),
				0x3000,
				offset_in_page(dbMemList[Loop].Mem))) {
			printk("NIK: bio add page failed- %dA\n", Loop);
		}
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[7].Mem),
				0x3e00,
				offset_in_page(dbMemList[7].Mem))) {
			printk("NIK: bio add page failed- %dB\n", Loop);
		}
	} else if (Loop == 7) {
		if (!bio_add_page(BioList[Loop],
				virt_to_page(dbMemList[Loop].Mem),
				0x2000,
				offset_in_page(dbMemList[Loop].Mem))) {
			printk("NIK: bio add page failed- %dA\n", Loop);
		}
		if ...
Previous thread: [GIT PULL] ALSA fixes (#2) by Takashi Iwai on Thursday, October 2, 2008 - 7:23 am. (1 message)

Next thread: [PATCH 1/5] xen: clean up label by Steven Rostedt on Thursday, October 2, 2008 - 8:05 am. (1 message)