Re: [patch] mm: fix XIP file writes

Previous thread: Re: Is BIO_RW_FAILFAST really usable? by Andrey Borzenkov on Monday, December 3, 2007 - 9:19 pm. (1 message)

Next thread: none
From: Nick Piggin
Date: Monday, December 3, 2007 - 9:26 pm

Hi,

This is my proposal for a (hopefully) backwards compatible rd driver.
The motivation for me is not pressing, except that I have this code
sitting here that is either going to rot or get merged. I'm happy to
make myself maintainer of this code, but if any real block device
driver writer would like to, that would be fine by me ;)

Comments?

--

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

   text    data     bss     dec     hex filename
   2837     849     384    4070     fe6 drivers/block/rd.o
   3528     371      12    3911     f47 drivers/block/brd.o

Text is larger, but data and bss are smaller, making total size smaller.

A few other nice things about it:
- Similar structure and layout to the new loop ...
From: Andrew Morton
Date: Monday, December 3, 2007 - 11:29 pm

This ramdisk driver can use highmem whereas the existing one can't (yes?). 
That's a pretty major difference.  Plus look at the revoltingness in rd.c's



Why GFP_NOIO?

Have you thought about __GFP_MOVABLE treatment here?  Seems pretty

I have vague memories that radix_tree_gang_lookup()'s "naive"
implementation may return fewer items than you asked for even when there






hm, some other thread can instantiate pagecache here.  I guess it's always

Maybe create a new helper function which does
invalidate_bh_lrus()+truncate_inode_pages(), call that from kill_bdev() and
here, make invalidate_bh_lrus() static.

That's a separate patch, I guess.
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 12:01 am

Same thing I guess -- it will use more memory in the common case, but should
actually be able to reclaim slightly more memory under pressure, so the
tiny systems guys shouldn't have too much trouble.

And we're avoiding the whole class of aliasing problems where mmap on
the old rd driver means that you're creating new mappings to your backing


Or 7th, depend on how you count ;) I always thought redefining it is a
prerequsite to getting anything merged into the block layer, so I'm too

I guess it is. Although on one hand the radix-tree uses unsigned long, but

I guess it has similar issues to rd.c -- we can't risk recursing
into the block layer here. However unlike rd.c, we _could_ easily
add some mode or ioctl to allocate the backing store upfront, with

AFAIK that only applies to pagecache but I haven't been paying much
attention to that stuff lately. It wouldn't be hard to move these

Good memory, but it's the low level leaf traversal that bales out at
boundaries. The higher level code then retries, so we should be OK

I guess the API is using size_t, so that would be the more approprite type

The brd backing store never gets any userspace aliases, so I think it should

Copying into the pagecache yes needs to flush I think. Although strictly it
also needs a write barrier to prevent these stores being passed by the store
to set PG_uptodate (but that's another story, which I think should be fixed
 

I was thinking also that perhaps the buffer cache layer could intercept
the ioctl on the way through and flush the buffer cache before going to
the device -- so device drivers don't have to have _any_ knowledge
about the buffer cache...?

Thanks for the review, I'll post an incremental patch in a sec.

--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 12:08 am

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -56,7 +56,7 @@ struct brd_device {
  */
 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 {
-	unsigned long idx;
+	pgoff_t idx;
 	struct page *page;
 
 	/*
@@ -87,13 +87,17 @@ static struct page *brd_lookup_page(stru
  */
 static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 {
-	unsigned long idx;
+	pgoff_t idx;
 	struct page *page;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
 		return page;
 
+	/*
+	 * Must use NOIO because we don't want to recurse back into the
+	 * block or filesystem layers from page reclaim.
+	 */
 	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
 	if (!page)
 		return NULL;
@@ -148,6 +152,11 @@ static void brd_free_pages(struct brd_de
 
 		pos++;
 
+		/*
+		 * This assumes radix_tree_gang_lookup always returns as
+		 * many pages as possible. If the radix-tree code changes,
+		 * so will this have to.
+		 */
 	} while (nr_pages == FREE_BATCH);
 }
 
@@ -159,7 +168,7 @@ static int copy_to_brd_setup(struct brd_
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	if (!brd_insert_page(brd, sector))
 		return -ENOMEM;
 	if (copy < n) {
@@ -181,7 +190,7 @@ static void copy_to_brd(struct brd_devic
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, PAGE_SIZE - offset);
 	page = brd_lookup_page(brd, sector);
 	BUG_ON(!page);
 
@@ -213,7 +222,7 @@ static void copy_from_brd(void *dst, str
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
-	copy = min((unsigned long)n, PAGE_SIZE - offset);
+	copy = ...
From: Rob Landley
Date: Tuesday, December 4, 2007 - 12:55 am

For the embedded world, initramfs has pretty much rendered initrd obsolete, 
and that was the biggest user of the ramdisk code I know of.  Beyond that, 
loopback mounts give you more flexible transient block devices than ramdisks 
do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
really needed something like that I'd just make a loopback mount in a ramfs 
instance.)

Embedded users who still want a block interface for memory are generally 
trying to use a cramfs or squashfs image out of ROM or flash, although there 
are flash-specific filesystems for this and I dunno if they're actually 
mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
haven't tried that myself yet...)

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 2:29 am

They are, although we could easily hook up a couple more ioctls for them
now (if anybody is interested).

The rd driver can potentially be a _lot_ more scalable than the loop driver.
It's completely lockless in the fastpath and doesn't even dirty any
cachelines except for the actual destination memory. OK, this is only
really important for testing purposes (eg. testing scalability of a

OK, it would be interesting to hear from anyone using rd for things like
cramfs. I don't think we can get rid of the code entirely, but it sounds
like the few downsides of the new code won't be big problems.

Thanks for the feedback.
--

From: Rob Landley
Date: Tuesday, December 4, 2007 - 12:53 pm

From a usability perspective, an ioctl is no substitute for being able to 

I wouldn't say not important: The "database in RAM" people will probably love 
you, at least if they insist on using Oracle.  (Filesystem schmilesystem, 
gimme a raw block device and let me implement my own filesystem from 
userspace...)


I'd be interested in that too.  I've heard it proposed a lot but not actually 
seen anyone bother to implement it yet...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
--

From: Christian Borntraeger
Date: Tuesday, December 4, 2007 - 2:54 am

Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:

This is just an idea, I dont know if it is worth the trouble, but have you
though about implementing direct_access for brd? That would allow 
execute-in-place (xip) on brd eliminating the extra copy.

Christian
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 3:10 am

Actually that's a pretty good idea. It would allow xip to be tested
without special hardware as well...

I'll see what the patch looks like. Thanks

--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 4:21 am

This seems to work, although I needed another patch for ext2 too.
Never run XIP before, but I'm able to execute files out of the -oxip
mount, so I'm guessing it's working ;)

---

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
+config BLK_DEV_XIP
+	bool "Support XIP filesystems on RAM block device"
+	depends on BLK_DEV_RAM
+	default n
+	help
+	  Support XIP filesystems (such as ext2 with XIP support on) on
+	  top of block ram device. This will slightly enlarge the kernel, and
+	  will prevent RAM block device backing store memory from being
+	  allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
 	pgoff_t idx;
 	struct page *page;
+	gfp_t gfp_flags;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
@@ -97,7 +98,16 @@ static struct page *brd_insert_page(stru
 	/*
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
+	 *
+	 * Cannot support XIP and highmem, because our ->direct_access
+	 * routine for XIP must return memory that is always addressable.
+	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * restriction might be able to be lifted.
 	 */
+	gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+	gfp_flags |= __GFP_HIGHMEM;
+#endif
 	page = alloc_page(GFP_NOIO | ...
From: Nick Piggin
Date: Tuesday, December 4, 2007 - 4:23 am

Am I missing something here? I wonder how s390 works without this change?

--
ext2 should not worry about checking sb->s_blocksize for XIP before the
sb's blocksize actually gets set.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
 
 	blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-	if ((ext2_use_xip(sb)) && ((blocksize != PAGE_SIZE) ||
-				  (sb->s_blocksize != blocksize))) {
+	if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
 		if (!silent)
 			printk("XIP: Unsupported blocksize\n");
 		goto failed_mount;
--

From: Carsten Otte
Date: Wednesday, December 5, 2007 - 8:43 am

"blocksize" contains the blocksize of the device here, and 
sb->s_blocksize does contain the filesystem block size as saved in the 
super block. Xip does only work, if both do match PAGE_SIZE because it 
does'nt support multiple calls to direct_access in the get_xip_page 
address space operation. Thus we check both here, actually this was 
changed from how it looks after your patch as a bugfix where our 
tester tried a 4k filesystem on a 2k blockdev.
Did I miss something?
--

From: Nick Piggin
Date: Wednesday, December 5, 2007 - 4:33 pm

You mean blocksize is the size of the ext2 block, and s_blocksize is the

However, the bdev block size may be changed with sb_set_blocksize. It
doesn't actually have to match the hardware sector size -- if this
does matter for XIP, then I think you need some other check here.
--

From: Carsten Otte
Date: Thursday, December 6, 2007 - 1:43 am

Hmmmmhh. For a bdev with PAGE_SIZE hardsect size, there is no other 
valid value then PAGE_SIZE that one could set it to. Or can it indeed 
be changed to a value greater then PAGE_SIZE or smaller then hardsect 
size?

--

From: Nick Piggin
Date: Thursday, December 6, 2007 - 1:52 am

It can't be made smaller (or larger, in current kernels). But you
already get all that checking done for you -- both by checking that
the filesystem blocksize == PAGE_SIZE, and by the error checking in
sb_set_blocksize.

After my patch, we can do XIP in a hardsect size < PAGE_SIZE block
device -- this seems to be a fine thing to do at least for the
ramdisk code. Would this situation be problematic for existing drivers,
and if so, in what way?

Thanks,
Nick

--

From: Carsten Otte
Date: Thursday, December 6, 2007 - 2:59 am

I have done some archeology, and our ancient CVS logs show this check 
was introduced in early 2005 into our 2.6.x. codebase. However, it 
existed way before, and was copied from our prehistorical ext2 split 
named xip2 back in the old days of 2.4.x where we did not really have 
a block device behind because that one was scamped into the file 
system in a very queer way.
After all, I don't see any risk in removing the check. The only driver 
we have that does direct_access is drivers/s390/block/dcssblk.c, and 
that one only supports block_size == PAGE_SIZE. I think the patch 
should go into mainline.

Acked-by: Carsten Otte <cotte@de.ibm.com>

--

From: Nick Piggin
Date: Thursday, December 6, 2007 - 3:18 am

OK, thanks for taking a look at that. It will be helpful for testing

Actually another one's recently sprung up too (arch/powerpc/sysdev/axonram.c)
but it looks like that one should be fine as it looks to be all simply memory

Thanks! Andrew, would you queue this up for 2.6.25 please?
--

From: Carsten Otte
Date: Thursday, December 6, 2007 - 3:24 am

I have'nt looked at it yet. I do appreciate it, I think it might 
broaden the user-base of this feature which is up to now s390 only due 
to the fact that the flash memory extensions have not been implemented 
(yet?). And it enables testing xip on other platforms. The patch is on 
Oh I know, the author's office is aproximately 500 meters away from 
mine, next building. I have'nt heared of anyone actually using that 
one lately.

cheers,
Carsten
--

From: Rob Landley
Date: Thursday, December 6, 2007 - 11:11 am

query: which feature is currently s390 only?  (Execute In Place?)

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
--

From: Jared Hulbert
Date: Thursday, December 6, 2007 - 8:22 pm

I think so.  The filemap_xip.c functionality doesn't work for Flash
memory yet.  Flash memory doesn't have struct pages to back it up with
which this stuff depends on.
--

From: Rob Landley
Date: Thursday, December 6, 2007 - 9:17 pm

Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses 
its own weird processor design rather than x86, x86-64, arm, or mips 
processors.

How does "struct page" enter into this?

What I want to know is, are you saying execute in place doesn't work on things 
like arm and mips?  (I so, I was unaware of this.  I heard about somebody 
getting it to work on a Nintendo DS: 
http://forums.maxconsole.net/showthread.php?t=18668 )

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
--

From: Nick Piggin
Date: Thursday, December 6, 2007 - 9:23 pm

It's just that the only device driver with XIP support for the moment is
s390 (and an obscure powerpc one).

Now with my ramdisk patch, anybody can use it. That's just using regular
RAM, but there is nothing preventing XIP backed by more exotic storage,
provided the CPU can natively address and execute from it.
 
--

From: Jared Hulbert
Date: Thursday, December 6, 2007 - 9:40 pm

Right.  filemap_xip.c allows for an XIP filesystem.  The only
filesystem that is supported is ext2.  Even that requires a block
device driver thingy, which I don't understand, that's specific to the


XIP works fine on things like arm and mips.  However there is mixed
support in the mainline kernel for it.  For example, you can build an
XiP kernel image for arm since like 2.6.10 or 12.  Also MTD has an XiP
aware mode that protects XiP objects in flash from get screwed up
during programs and erases.  But there is no mainlined solution for
XiP of applications from the filesystem.  However there have been
patches for cramfs to do this for years.  They are kind of messy and
keep getting rejected.  I do have a solution in the works for this
part of it - http://axfs.sf.net.
--

From: Carsten Otte
Date: Friday, December 7, 2007 - 1:59 am

Struct page is not the major issue. The primary problem is writing to 
the media (and I am not a flash expert at all, just relaying here): 
For some period of time, the flash memory is not usable and thus we 
need to make sure we can nuke the page table entries that we have in 
userland page tables. For that, we need a callback from the device so 
that it can ask to get its references back. Oh, and a put_xip_page 
counterpart to get_xip_page, so that the driver knows when it's safe 
to erase.

cheers,
Carsten
--

From: Jared Hulbert
Date: Friday, December 7, 2007 - 2:52 am

Well... That's the biggest/hardest problem, yes.  But not the first.
First we got to tackle the easy read only case, which doesn't require
any of that unpleasantness, yet which is used in a bunch of out of
tree hacks.
--

From: Andrew Morton
Date: Tuesday, December 4, 2007 - 4:26 am

A dubious tradeoff?
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 4:35 am

On big highmem machines certainly. It may be somewhat useful on small
memory systems... but having the config option there is nice for a VM
developer without an s390 easily available ;)

But don't apply these XIP patches yet -- after a bit more testing I'm
seeing some data corruption, so I'll have to work out what's going
wrong with that first.
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 6:00 am

Here we go. See, brd already found a bug ;)

Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.

---

Writing to XIP files at a non-page-aligned offset results in data corruption
because the writes were always sent to the start of the page.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
 		fault_in_pages_readable(buf, bytes);
 		kaddr = kmap_atomic(page, KM_USER0);
 		copied = bytes -
-			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
+			__copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
 		kunmap_atomic(kaddr, KM_USER0);
 		flush_dcache_page(page);
 
--

From: Christian Borntraeger
Date: Monday, December 10, 2007 - 7:38 am

I asked myself why this problem never happened before. So I asked our testers
to reproduce this problem on 2.6.23 and service levels. As the testcase did
not trigger, I looked into the 2.6.23 code. This problem was introduced by
commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from
Nick Piggin) during 2.6.24-rc:
--------snip-------
-		copied = filemap_copy_from_user(page, offset, buf, bytes);
[...]
+		copied = bytes -
+			__copy_from_user_inatomic_nocache(kaddr, buf, bytes);
-------------------

So yes, its good to have xip on brd. It even tests your changes ;-)
Good news is, that we dont need anything for stable.

Christian
--

From: Nick Piggin
Date: Tuesday, December 11, 2007 - 9:03 pm

Heh ;) that explains a lot. Thanks!
--

From: Duane Griffin
Date: Tuesday, December 4, 2007 - 5:06 am

I think that should be alloc_page(gfp_flags), no?

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan
--

From: Nick Piggin
Date: Tuesday, December 4, 2007 - 6:03 am

Yes. Here is a resend. Andrew, please apply this one (has passed some
testing with ext2 XIP).

--

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
 	  The default value is 4096 kilobytes. Only change this if you know
 	  what you are doing.
 
+config BLK_DEV_XIP
+	bool "Support XIP filesystems on RAM block device"
+	depends on BLK_DEV_RAM
+	default n
+	help
+	  Support XIP filesystems (such as ext2 with XIP support on) on
+	  top of block ram device. This will slightly enlarge the kernel, and
+	  will prevent RAM block device backing store memory from being
+	  allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
 	tristate "Packet writing on CD/DVD media"
 	depends on !UML
Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
 	pgoff_t idx;
 	struct page *page;
+	gfp_t gfp_flags;
 
 	page = brd_lookup_page(brd, sector);
 	if (page)
@@ -97,8 +98,17 @@ static struct page *brd_insert_page(stru
 	/*
 	 * Must use NOIO because we don't want to recurse back into the
 	 * block or filesystem layers from page reclaim.
+	 *
+	 * Cannot support XIP and highmem, because our ->direct_access
+	 * routine for XIP must return memory that is always addressable.
+	 * If XIP was reworked to use pfns and kmap throughout, this
+	 * restriction might be able to be lifted.
 	 */
-	page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
+	gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+	gfp_flags |= __GFP_HIGHMEM;
+#endif
+	page = alloc_page(gfp_flags);
 	if (!page)
 ...
From: Matthew Wilcox
Date: Monday, January 14, 2008 - 9:47 am

You're using kmap_atomic, but I see no reason you can't be preempted.
Don't you need to at least disable preemption while you have stuff
atomically kmapped?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Jens Axboe
Date: Monday, January 14, 2008 - 10:21 am

kmap_atomic() disables preemption through pagefault_disable().

-- 
Jens Axboe

--

Previous thread: Re: Is BIO_RW_FAILFAST really usable? by Andrey Borzenkov on Monday, December 3, 2007 - 9:19 pm. (1 message)

Next thread: none