Hi,
loop.c currently uses the page cache interface to do IO to file backed
devices. This works reasonably well for simple things, like mapping an
iso9660 file for direct mount and other read-only workloads. Writing is
somewhat problematic, as anyone who has really used this feature can
attest to - it tends to confuse the vm (hello kswapd) since it break
dirty accounting and behaves very erratically on writeout. Did I mention
that it's pretty slow as well, for both reads and writes?It also behaves differently than a real drive. For writes, completions
are done once they hit page cache. Since loop queues bio's async and
hands them off to a thread, you can have a huge backlog of stuff to do.
It's hard to attempt to guarentee data safety for file systems on top of
loop without making it even slower than it currently is.Back when loop was only used for iso9660 mounting and other simple
things, this didn't matter. Now it's often used in xen (and others)
setups where we do care about performance AND writing. So the below is a
attempt at speeding up loop and making it behave like a real device.
It's a somewhat quick hack and is still missing one piece to be
complete, but I'll throw it out there for people to play with and
comment on.So how does it work? Instead of punting IO to a thread and passing it
through the page cache, we instead attempt to send the IO directly to the
filesystem block that it maps to. loop maintains a prio tree of known
extents in the file (populated lazily on demand, as needed). Advantages
of this approach:- It's fast, loop will basically work at device speed.
- It's fast, loop it doesn't put a huge amount of system load on the
system when busy. When I did comparison tests on my notebook with an
external drive, running a simple tiobench on the current in-kernel
loop with a sparse file backing rendered the notebook basically
unusable while the test was ongoing. The remapper version had no more
impact than it did when used directly on the extern...
Hello everyone,
Here is a modified version of Jens' patch. The basic idea is to push
the mapping maintenance out of loop and down into the filesystem (ext2
in this case).Two new address_space operations are added, one to map
extents and the other to provide call backs into the FS as io is
completed.Still TODO for this patch:
* Add exclusion between filling holes and readers. This is partly
implemented, when a hole is filled by the FS, the extent is flagged as
having a hole. The idea is to check this flag before trying to read
the blocks and just send back all zeros.The flag would be cleared when the blocks filling the hole have been
written.* Exclude page cache readers and writers
* Add a way for the FS to request a commit before telling the higher
layers the IO is complete. This way we can make sure metadata related
to holes is on disk before claiming the IO is really done. COW based
filesystems will also needed it.* Change loop to use fast mapping only when the new address_space op is
provided (whoops, forgot about this until just now)* A few other bits for COW, not really relevant until there
is...something COW using it.-chris
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -76,6 +76,7 @@
#include <linux/gfp.h>
#include <linux/kthread.h>
#include <linux/splice.h>
+#include <linux/extent_map.h>#include <asm/uaccess.h>
@@ -481,16 +482,51 @@ static int do_bio_filebacked(struct loop
return ret;
}+#define __lo_throttle(wq, lock, condition) \
+do { \
+ DEFINE_WAIT(__wait); \
+ for (;;) { \
+ prepare_to_wait((wq), &__wait, TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ spin_unlock_irq((lock)); \
+ io_schedule(); \
+ spin_lock_irq((lock)); \
+ } \
+ finish_wait((wq), &__wait); \
+} while (0) \
+
+#define lo_act_bio(bio) ((bio)->bi_bdev...
Looks pretty good. Essentially the loop side is 100% the same, it just
offloads the extent ownership to the fs (where it belongs). So I like
it. Attaching a small cleanup/fixup patch for loop, don't think it needs
further explanations.One suggestion - free_extent_map(), I would call that put_extent_map()
instead.diff -u b/drivers/block/loop.c b/drivers/block/loop.c
--- b/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -677,13 +677,14 @@
if (IS_ERR(lfe))
return -EIO;- while(!lfe) {
+ while (!lfe) {
loop_schedule_extent_mapping(lo, bio->bi_sector,
bio->bi_size, 1);
lfe = loop_lookup_extent(lo, start, GFP_ATOMIC);
if (IS_ERR(lfe))
return -EIO;
}
+
/*
* handle sparse io
*/
@@ -802,13 +803,13 @@
{
struct bio *orig_bio = bio->bi_private;
struct inode *inode = bio->bi_bdev->bd_inode;
+ struct address_space *mapping = inode->i_mapping;
u64 start = orig_bio->bi_sector << 9;
u64 len = bio->bi_size;- if (inode->i_mapping->a_ops->extent_io_complete) {
- inode->i_mapping->a_ops->extent_io_complete(inode->i_mapping,
- start, len);
- }
+ if (mapping->a_ops->extent_io_complete)
+ mapping->a_ops->extent_io_complete(mapping, start, len);
+
bio_put(bio);
bio_endio(orig_bio, err);
}
@@ -829,6 +830,7 @@
err = -ENOMEM;
goto out;
}
+
/*
* change the sector so we can find the correct file offset in our
* endio
@@ -847,7 +849,6 @@
goto out;;
}-
disk_block = em->block_start;
extent_off = start - em->start;
new_bio->bi_sector = (disk_block + extent_off) >> 9;
@@ -924,11 +925,8 @@
spin_unlock_irq(&lo->lo_lock);BUG_ON(!bio);
- if (lo_act_bio(bio))
- bio_act = 1;
- else
- bio_act = 0;+ bio_act = lo_act_bio(bio);
loop_handle_bio(lo, bio);spin_lock_irq(&lo->lo_lock);
@@ -1103,11 +1101,9 @@
return -EINVAL;/*
- * Need a working bmap. T...
I split and merged the patch into five bits (added ext3 support), so
perhaps that would be easier for people to read/review. Attached and
also exist in the loop-extent_map branch here:http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=loop-extent_map
--
Jens Axboe
Seems my ext3 version doesn't work, it craps out in
ext3_get_blocks_handle() triggering this bug:J_ASSERT(handle != NULL || create == 0);
I'll see if I can fix that, being fairly fs ignorant...
--
Jens Axboe-
This works, but probably pretty suboptimal (should end the new journal
in map_io_complete()?). And yes I know the >> 9 isn't correct, since the
fs block size is larger. Just making sure that we always have enough
blocks.Punting to Chris!
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 55e677d..e97181a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1002,11 +1002,25 @@ static struct extent_map *ext3_map_extent(struct address_space *mapping,
gfp_t gfp_mask)
{
struct extent_map_tree *tree = &EXT3_I(mapping->host)->extent_tree;
+ handle_t *handle = NULL;
+ struct extent_map *ret;- return map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
+ if (create) {
+ handle = ext3_journal_start(mapping->host, len >> 9);
+ if (IS_ERR(handle))
+ return (struct extent_map *) handle;
+ }
+
+ ret = map_extent_get_block(tree, mapping, start, len, create, gfp_mask,
ext3_get_block);
+ if (handle)
+ ext3_journal_stop(handle);
+
+ return ret;
}+
+
/*
* `handle' can be NULL if create is zero
*/--
Jens Axboe-
On Tue, 15 Jan 2008 11:07:40 +0100
You can use DIO_CREDITS instead of len >> 9, just like the ext3
O_DIRECT code does. Your current patch is fine, except it breaks
data=ordered rules. My plan to work within data=ordered:1) Inside ext3_map_extent (while the transaction was running), increment
a counter in the ext3 journal for number of pending IOs. Then end the
transaction handle.2) Drop this counter inside the IO completion call
3) Change the ext3 commit code to wait for the IO count to be zero.
I'll give it a shot later this week, until then your current patch is
just data=writeback, which is good enough for testing.-chris
-
Hi Jens,
This looks really useful.
Get_block methods are pretty fast and you have caching in the level
above you, so you might be able to get away with no cache of physical
addresses at all, in which case you just need i_mutex and i_alloc_sem
at get_block time. This would save a pile of code and still have the
main benefit of avoiding double caching.If you use ->get_block instead of bmap, it will fill in file holes for
you, but of course get_block is not exposed, and Al is likely to bark
at anyone who exposes it.Instead of exposing get_block you could expose an aops method
like ->bio_transfer that would hide the use of *_get_block in a library
routine, just as __blockdev_direct_IO does. Chances are, there are
other users besides loop that would be interested in a generic way of
performing bio transfers to files.I presume you would fall back to the existing approach for any
filesystem without get_block. You could handle this transparently with
a default library method that does read/write.Regards,
Daniel
-
I'm not too fond of the tree either, but it serves an important purpose
as well - we need to be careful in calling bmap() as not to deadlock the
fs under vm pressure. So the current code punts to a thread for bmap()
on extents we don't already have stored in loop. And that slows things
down of course, we would have to still punt every IO to loopd instead of... things are already moving forward, Chris has a new interface for
this and tied it in with the loop fastfs mode. I think he'll post it
later today.--
Jens Axboe-
get_write_access()/put_write_access() will block other writers.
But as pointed out by others that is not enough for this.
I suppose you could use a white list like a special flag for file systems
(like ext2/ext3) that do not reallocate blocks.-Andi
-
Yeah, basically allowing O_RDONLY | O_DIRECT opens should be ok, but we
can't allow writes and we can't allow page cache to exist for this fileIrk, but yeah we probably need something like that for now until Chris
proposes his API addition.--
Jens Axboe-
Since you are looking for comments, I'll mention a loop-related behavior
I've been seeing and see if it gets comments or is useful, since it can
be used to tickle bad behavior on demand.I have an 6GB sparse file, which I mount with cryptoloop and populate as
an ext3 filesystem (more later on why). I then copy ~5.8GB of data to
the filesystem, which is unmounted to be burnt to a DVD. Before it's
burned the "dvdisaster" application is used to add some ECC information
to the end, and make an image which fits on a DVD-DL. Media will be
burned and distributed to multiple locations.The problem:
When copying with rsync, the copy runs at ~25MB/s for a while, then
falls into a pattern of bursts of 25MB/s followed by 10-15 sec of iowait
with no disk activity. So I tried doing the copy by cpio
find . -depth | cpio -pdm /mnt/loop
which shows exactly the same behavior. Then, for no good reason I tried
find . -depth | cpio -pBdm /mnt/loop
and the copy ran at 25MB/s for the whole data set.I was able to see similar results with a pure loop mount, I only mention
the crypto for accuracy. Because many of these have been shipped over
the last two years and new loop code would only be useful in this case--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
-
You told Christoph that just using direct-IO from kernel still doesn't
give you the required behaviour... What about queueing the IO directly
*and* using direct-IO? I guess it still has to go through the underlyingJust a quick question (I haven't looked closely at the code): how come
you are using a prio tree for extents? I don't think they could be
overlapping?
-
We defintively need to go through the filesystem for I/O submission,
and also for I/O completion. Thinking of the async submission might be
what Peter actually implemented for his network swapping patches as
you really wouldn't want to write it out synchronously.IMHO this shouldn't be done in the loop driver anyway. Filesystems have
their own effricient extent lookup trees (well, at least xfs and btrfs
do), and we should leverage that instead of reinventing it.
-
Completely agree, it's just needed right now for this solution since all
we have is a crappy bmap() interface to get at those mappings.--
Jens Axboe-
So let's fix the interface instead of piling crap ontop of it. As I
said I think Peter has something to start with so let's beat on it
until we have something suitable. If we aren't done by end of Feb
I'm happy to host a hackfest to get it sorted around the fs/storage
summit..-
Sure, I'm all for doing it the Right Way. I wasn't aware of anything
Peter was doing in this area, so lets please see it.It's not like opportunities to improve this haven't been around. My plan
was/is to convert to using the get_block() tricks of O_DIRECT, one could
easily argue that the work should have been done then. And perhaps
direct-io.c wouldn't be such a steaming pile of crap if it had been
done, and loop would already be fine since we could have tapped intoCount me in :)
--
Jens Axboe-
On Thu, 10 Jan 2008 08:54:59 +0000
Ok, I've been meaning to break my extent_map code up, and this is a
very good reason. I'll work up a sample today based on Jens' code.The basic goals:
* Loop (swap) calls into the FS for each mapping. Any caching happens
on the FS side.
* The FS returns an extent, filling any holesSwap would need to use an extra call early on for preallocation.
Step two is having a call back into the FS allow the FS to delay the
bios until commit completion so that COW and delalloc blocks can be
fully on disk when the bios are reported as done. Jens, can you add
some way to queue the bio completions up?-chris
-
Sure, a function to save a completed bio and a function to execute
completions on those already stored?--
Jens Axboe-
On Thu, 10 Jan 2008 14:03:24 +0100
Sounds right, I'm mostly looking for a way to aggregate a few writes to
make the commits a little larger.-chris
-
I have this patch to add swap_out/_in methods. I expect we can loosen
the requirement for swapcache pages and change the name a little.previously posted here:
http://lkml.org/lkml/2007/5/4/143---
Subject: mm: add support for non block device backed swap filesNew addres_space_operations methods are added:
int swapfile(struct address_space *, int);
int swap_out(struct file *, struct page *, struct writeback_control *);
int swap_in(struct file *, struct page *);When during sys_swapon() the swapfile() method is found and returns no error
the swapper_space.a_ops will proxy to sis->swap_file->f_mapping->a_ops, and
make use of swap_{out,in}() to write/read swapcache pages.The swapfile method will be used to communicate to the address_space that the
VM relies on it, and the address_space should take adequate measures (like
reserving memory for mempools or the like).This new interface can be used to obviate the need for ->bmap in the swapfile
code. A filesystem would need to load (and maybe even allocate) the full block
map for a file into memory and pin it there on ->swapfile(,1) so that
->swap_{out,in}() have instant access to it. It can be released on
->swapfile(,0).The reason to provide ->swap_{out,in}() over using {write,read}page() is to
1) make a distinction between swapcache and pagecache pages, and
2) to provide a struct file * for credential context (normally not needed
in the context of writepage, as the page content is normally dirtied
using either of the following interfaces:
write_{begin,end}()
{prepare,commit}_write()
page_mkwrite()
which do have the file context.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Documentation/filesystems/Locking | 19 ++++++++++++
Documentation/filesystems/vfs.txt | 17 +++++++++++
include/linux/buffer_head.h | 2 -
include/linux/fs.h | 8 +++++
include/linux/swap.h | 3 +
mm/Kconfig...
So this is where I don't think that's good enough, you cannot require a
full block/extent mapping of a file on setup. It can take quite some
time, a little testing I did here easily took 5 seconds for only a
couple of gigabytes. And that wasn't even worst case for that size. It
also wastes memory by populating extents that we may never read or
write.If you look at the loop addition I did, it populates lazily as needed
with some very simple logic to populate-ahead. In practice that performs
as well as a pre-populated map, the first IO to a given range will just
be a little slower since we have to bmap() it.Do you have plans to improve this area?
--
Jens Axboe-
Nope, for swap it _must_ be there, there is just no way we can do block
allocation on swapout.That said, the swapfile() interface can be used to pre-populate the
extend/block mapping, and when using swap_(in/out) without it, it can be
done lazily.-
I appreciate that fact, just saying that we could have more flexibility
I think the interface is too simple, I would much rather have a way to
dig into the file mappings and be allowed to request population and so
on. Without that, it can't be used for eg loop.--
Jens Axboe-
I'm open to suggestions :-), if you can propose an interface we both can
use I'm all ears.Although poking at the extends sounds like bmap and we were just trying
to get rid of that.Perhaps something like badvise(), where you can suggest loading/dropping
extend information on a voluntary basis.-
Because I'm really lazy - the core of this was basically first written
as a quick hack and then I always go shopping for reusable data
structures. prio trees fit the bill nicely, they described extents and
allowed loopup with a key anywhere in that extent. You are right in that
I don't need the overlap handling at all, and Chris already tried to
talk me into reusing his btrfs extent code :-)So I may just do the latter, turning it into a lib/extent-map.c in the
longer run. My first priority was just having something that worked so I
could test it. At the end of the day, not a single soul would ever
notice if the prio tree ended up being slightly slower than a custom
solution.--
Jens Axboe-
Here's the latest version of dm-loop, for comparison.
To try it out,
ln -s dmsetup dmlosetup
and supply similar basic parameters to losetup.
(using dmsetup version 1.02.11 or higher)Alasdair
From: Bryn Reeves <breeves@redhat.com>
This implements a loopback target for device mapper allowing a regular
file to be treated as a block device.Signed-off-by: Bryn Reeves <breeves@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>---
drivers/md/Kconfig | 9
drivers/md/Makefile | 1
drivers/md/dm-loop.c | 1036 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1046 insertions(+)Index: linux-2.6.24-rc6/drivers/md/Kconfig
===================================================================
--- linux-2.6.24-rc6.orig/drivers/md/Kconfig 2008-01-07 18:32:05.000000000 +0000
+++ linux-2.6.24-rc6/drivers/md/Kconfig 2008-01-07 18:33:12.000000000 +0000
@@ -229,6 +229,15 @@ config DM_CRYPTIf unsure, say N.
+config DM_LOOP
+ tristate "Loop target (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows you to treat a regular file as
+ a block device.
+
+ If unsure, say N.
+
config DM_SNAPSHOT
tristate "Snapshot target (EXPERIMENTAL)"
depends on BLK_DEV_DM && EXPERIMENTAL
Index: linux-2.6.24-rc6/drivers/md/Makefile
===================================================================
--- linux-2.6.24-rc6.orig/drivers/md/Makefile 2008-01-07 18:32:05.000000000 +0000
+++ linux-2.6.24-rc6/drivers/md/Makefile 2008-01-07 18:33:12.000000000 +0000
@@ -34,6 +34,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_LOOP) += dm-loop.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_EMC) += dm-emc.o
obj-$(CONFIG_DM_MULTIPATH_HP) += dm-hp-sw.o
Index: linux-2.6.24-rc6/driv...
Why oh why does dm always insist to reinvent everything? That's bad
enough in itself, but on top of that most of the extra stuff ends up
being essentially unmaintained.If we instead improve loop, everyone wins.
Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit
outside your own sandbox.--
Jens Axboe-
So I looked at the code - it seems you build a full extent of the blocks
in the file, filling holes as you go along. I initally did that as well,
but that is to slow to be usable in real life.You also don't support sparse files, falling back to normal fs
read/write paths. Supporting sparse files properly is a must, people
generally don't want to prealloc a huge disk backing.--
Jens Axboe-
How would you do sparse file support with passthrough loopback that
doesn't use pagecache?Holes are allocated at get_block function provided by each filesystem and
the function gets a buffer that is supposed to be in the pagecache. Now if
you want to allocate holes without pagecache, there's a problem --- new
interface to all filesystems is needed.It could be possible to use pagecache interface for filling holes and
passthrough interface for other requests --- but get_block is allowed to
move other blocks on the filesystem (and on UFS it really does), so
calling get_block to fill a hole could move other unrelated blocks which
would result in desychronized block map and corruption of both
filesystems.Mikulas
-
Please read the posted patch and the posts from Chris as well, it
basically covers everything in your email.The patch, as posted, doesn't work if the fs moves blocks around.
--
Jens Axboe-
On Thu, 10 Jan 2008 09:31:31 +0100
I don't quite get how the dm version is reinventing things. They use
the dmsetup command that they use for everything else and provide a
small and fairly clean module for bio specific loop instead of piling
it onto loop.c....Their code doesn't have the fancy hole handling that yours does, but
It is a natural fit in either place, as both loop and dm have a good
infrastructure for it. I'm not picky about where it ends up, but dm
wouldn't be a bad place.-chris
-
Things like raid, now file mapping functionality. I'm sure there are
more examples, it's how dm was always developed probably originating
back to when they developed mostly out of tree. And I think it's a bad
idea, we don't want duplicate functionality. If something is wrong withIf loop.c is a problem, I'd rather see a newloop.c (with a better name,
Well mine didn't exist 4 days ago, I was just listing missing
I know that's your opinion, I reserve the right to have my own on where
this functionality belongs :)--
Jens Axboe-
And the way this is done is simply broken. It means you have to get
rid of things like delayed or unwritten hands beforehand, it'll be
a complete pain for COW or non-block backed filesystems.The right way to do this is to allow direct I/O from kernel sources
where the filesystem is in-charge of submitting the actual I/O after
the pages are handed to it. I think Peter Zijlstra has been looking
into something like that for swap over nfs.-
COW is not that hard to handle, you just need to be notified of moving
blocks. If you view the patch as just a tighter integration between loop
and fs, I don't think it's necessarily that broken.I did consider these cases, and it can be done with the existing
That does sound like a nice approach, but a lot more work. It'll behave
differently too, the advantage of what I proposed is that it behaves
like a real device.I'm not asking you to love it (in fact I knew some people would complain
about this approach and I understand why), just tossing it out there to
get things rolling. If we end up doing it differently I don't really
care, I'm not married to any solution but merely wish to solve a
problem. If that ends up being solved differently, the outcome is the
same to me.--
Jens Axboe-
On Wed, 9 Jan 2008 10:43:21 +0100
Filling holes (delayed allocation) and COW are definitely a problem.
But at least for the loop use case, most non-cow filesystems will want
to preallocate the space for loop file and be done with it. Sparse
loop definitely has uses, but generally those users are willing to pay
a little performance.Jens' patch falls back to buffered writes for the hole case and
pretends cow doesn't exist. It's a good starting point that I hope toThe problem with O_DIRECT (or even O_SYNC) loop is that every write
into loop becomes synchronous, and it really changes the performance of
things like filemap_fdatawrite.If we just hand ownership of the file over to loop entirely and prevent
other openers (perhaps even forcing backups through the loop device),
we get fewer corner cases and much better performance.-chris
-
| Jesse Barnes | Re: PCI probing changes |
| Borislav Petkov | [PATCH] [KERNEL-DOC] kill warnings when building mandocs |
| Greg Kroah-Hartman | [PATCH 012/196] nozomi driver |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
