login
Header Space

 
 

[patch 0/6][RFC] Cleanup FIBMAP

Previous thread: [PATCH 2/2 ] Add support LZO in mkcramfs tool by vince kim on Friday, October 26, 2007 - 7:26 pm. (4 messages)

Next thread: [PATCH 1/2 ] Add support LZO in cramfs by vince kim on Friday, October 26, 2007 - 7:25 pm. (2 messages)
To: <linux-fsdevel@...>
Cc: <linux-kernel@...>
Date: Friday, October 26, 2007 - 7:37 pm

The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data.

I'm sending this as an RFC as I've only tested this on a x86_64 kernel with a 32bit binary on ext2 and I've noticed a couple ext2_warnings already.

I'm unsure of the locking in [4/6] fix_race_with_truncate.patch.  Any help here would greatly be appreciated.

The last patch, [6/6] drop_cap_sys_rawio_for_fibmap.patch, is of course, not to be applied until any remaining issues are fixed :)

Thanks,

Mike Waychison
--
-
To: Mike Waychison <mikew@...>
Cc: <linux-fsdevel@...>, <linux-kernel@...>
Date: Saturday, October 27, 2007 - 8:43 pm

Keep in mind FIBMAP is currently extremely expensive on some 
filesystems, e.g. ext3.  Therefore, additional filesystem-level work 
would have to be done in order for this not become a DoS issue.

	-hpa

-
To: Mike Waychison <mikew@...>
Cc: <linux-fsdevel@...>, <linux-kernel@...>
Date: Saturday, October 27, 2007 - 1:57 pm

Hi,

-&gt;bmap is ugly and horrible!  If you have to do this at the very least  
please cause -&gt;bmap64 to be able to return error values in case the  
file system failed to get the information or indeed such information  
does not exist as is the case for compressed and encrypted files for  
example and also for small files that are inside the on-disk inode  
(NTFS resident files and reiserfs packed tails are examples of this).

And another of my pet peeves with -&gt;bmap is that it uses 0 to mean  
"sparse" which causes a conflict on NTFS at least as block zero is  
part of the $Boot system file so it is a real, valid block...  NTFS  
uses -1 to denote sparse blocks internally.

Best regards,

	Anton



-- 
Anton Altaparmakov &lt;aia21 at cam.ac.uk&gt; (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-
To: Anton Altaparmakov <aia21@...>
Cc: Mike Waychison <mikew@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 10:10 am

On Sat, 27 Oct 2007 18:57:06 +0100

Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
was a way to indicate your-data-is-here-but-isn't-alone.  But that's
more of a feature for the FIEMAP stuff.

-chris
-
To: Chris Mason <chris.mason@...>
Cc: Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>, <adilger@...>
Date: Monday, October 29, 2007 - 3:16 pm

I hadn't heard of FIEMAP, so I went back and read the thread from 
April/May.  It seems that this is a much better approach than 
introducing a FIBMAP64.

What ever happened with this proposal?

Mike Waychison
-
To: Mike Waychison <mikew@...>
Cc: Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>, Kalpak Shah <Kalpak.Shah@...>
Date: Monday, October 29, 2007 - 3:47 pm

We made a patch for ext4 that we need to update and push upstream.  I've
just resent the spec we used in a separate email (attached to old thread)
for reference.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To: Chris Mason <chris.mason@...>
Cc: Anton Altaparmakov <aia21@...>, Mike Waychison <mikew@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 12:30 pm

And maybe we can step back and see what the callers of FIBMAP are doing
with the results they're getting.

One use is to discover the order in which to read file data that will
result in efficient IO.

If we had an interface specifically for this use case then perhaps a
sparse block would be better reported as the position of the inode
relative to other data blocks.  Maybe the inode block number in ext* land.

This use is interesting to me because it can be useful for any file
system, particularly networked file systems.

- z
-
To: Zach Brown <zab@...>
Cc: Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 3:18 pm

Can you clarify what you mean above with an example?  I don't really follow.

Thanks,

Mike Waychison


-
To: Mike Waychison <mikew@...>
Cc: Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 4:00 pm

Sure, take 'tar' as an example.  It'll read files in the order that
their names are returned from directory listing.  This can produce bad
IO patterns because the order in which the file names are returned
doesn't match the order of the file's blocks on disk.  (htree, I'm
looking at you!)

People have noticed that tar-like loads can be sped up greatly just by
sorting the files by their inode number as returned by stat(), never
mind the file blocks themselves.  One example of this is Chris Mason's
'acp'.

  http://oss.oracle.com/~mason/acp/

The logical extension of that is to use FIBMAP to find the order of file
blocks on disk and then doing IO on blocks in sorted order.  It'd take
work to write an app that does this reliably, sure.

In this use the application doesn't actually care what the absolute
numbers are.  It cares about their ordering.  File systems would be able
to chose whatever scheme they wanted for the actual values of the
results from a FIBMAP-alike as long as the sorting resulted in the right
IO patterns.

Arguing that this use is significant enough to justify an addition to
the file system API is a stretch.  I'm just sharing the observation.

- z
-
To: Zach Brown <zab@...>
Cc: Mike Waychison <mikew@...>, Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Wednesday, October 31, 2007 - 7:06 am

I use FIBMAP support for a few different things.

The first is to exactly the case that you describe above where we can 
use the first block of a file extracted by FIBMAP to produce an optimal 
sorting for the read order.  My testing showed that the cost of the 
extra fibmap was not too high compared to the speedup, but it was not a 
huge gain over the speedup gained when the read was done in inode sorted 
order.

The second use case is to look at the physical layout of blocks on disk 
for a specific file, use Mark Lord's write_long patches to inject a disk 
error and then read that file to make sure that we are handling disk IO 
errors correctly.  A bit obscure, but really quite useful.

We have also used FIBMAP a few times to try and map an observed IO error 
back to a file. Really slow and painful to do, but should work on any 
file system when a better method is not supported.


ric
-
To: Ric Wheeler <ric@...>
Cc: Mike Waychison <mikew@...>, Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Wednesday, October 31, 2007 - 12:16 pm

We're getting off of this FIBMAP topic, but this interests me.  Can we
explore this a little?  How did you find out about the error without
having a file to associate with it?  Drive scrubbing, or some such?

- z
-
To: Zach Brown <zab@...>
Cc: Mike Waychison <mikew@...>, Chris Mason <chris.mason@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Wednesday, October 31, 2007 - 1:17 pm

It would be even better if we could poke holes in metadata, etc, but 

Vladimir extended debugreiserfs to do an optimized reverse mapping scan 
of the disk sector to file/metadata/etc. Definitely worth having that 
ability for any file system.

We also do drive scrubbing, looking for bad sectors. The list of those 
sectors is fed into the reverse mapping code to enable us to gage the 
impact of the IO errors, start recovering the user files, etc.

The scrub code we use takes advantage of the read-verify command (to 
avoid data transfer from the drive to the page cache).

ric
-
To: Mike Waychison <mikew@...>
Cc: Zach Brown <zab@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 3:46 pm

On Mon, 29 Oct 2007 12:18:22 -0700

This is a larger topic of helping userland optimize access to groups of
files.  For example, during a readdir if we knew the next step was to
delete all the files found, we could do one top of readahead (or even
ordering the returned values).

If we knew the next step would be to read all the files found, a
different type of readahead would be useful.

But, we shouldn't inflict all of this on fibmap/fiemap....we'll get
lost trying to make the one true interface for all operations.

For grouping operations on files, I think a read_tree syscall with
hints for what userland will do (read, stat, delete, list
filenames), and a better cookie than readdir should do it.

-chris
-
To: Chris Mason <chris.mason@...>
Cc: Mike Waychison <mikew@...>, Anton Altaparmakov <aia21@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Monday, October 29, 2007 - 4:01 pm

Agreed, on both points.

- z
-
To: Anton Altaparmakov <aia21@...>
Cc: Mike Waychison <mikew@...>, <linux-fsdevel@...>, <linux-kernel@...>
Date: Saturday, October 27, 2007 - 5:45 pm

In practice, the meaning of 0 is file system [driver] dependent. For 
example in case of NTFS-3G it means that the block is sparse or the file is 
encrypted or compressed, or resident, or it's the $Boot file, or an error 
happened.

Thankfully the widely used FIBMAP users (swapon and the ever less used 
lilo) are only interested in the non-zero values and they report an error 
if the driver returns 0 for some reason. Which is perfectly ok since both 
swaping and Linux booting would fail using a sparse, encrypted, compressed, 
resident, or the NTFS $Boot file. 

But in real, both swap files and lilo work fine with NTFS if the needed 
files were created the way these softwares expect. If not then swapon or 
lilo will catch and report the file creation error.

Afair, somebody is doing (has done?) an indeed much needed, better 
alternative. Bmap is legacy, thank you Mike for maintaining it.

	Szaka

--
NTFS-3G Lead Developer:  http://ntfs-3g.org
-
Previous thread: [PATCH 2/2 ] Add support LZO in mkcramfs tool by vince kim on Friday, October 26, 2007 - 7:26 pm. (4 messages)

Next thread: [PATCH 1/2 ] Add support LZO in cramfs by vince kim on Friday, October 26, 2007 - 7:25 pm. (2 messages)
speck-geostationary