> On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
>> On Tuesday October 30,
gnb@sgi.com wrote:
>>> Of course snapshot cow elements may be part of more generic element
>>> trees. In general there may be more than one consumer of block usage
>>> hints in a given filesystem's element tree, and their locations in that
>>> tree are not predictable. This means the block extents mentioned in
>>> the usage hints need to be subject to the block mapping algorithms
>>> provided by the element tree. As those algorithms are currently
>>> implemented using bio mapping and splitting, the easiest and simplest
>>> way to reuse those algorithms is to add new bio flags.
>> So are you imagining that you might have a distinct snapshotable
>> elements, and that some of these might be combined by e.g. RAID0 into
>> a larger device, then a filesystem is created on that?
>
> I was thinking more a concatenation than a stripe, but yes you could
> do such a thing, e.g. to parallelise the COW procedure. We don't do
> any such thing in our product; the COW element is always inserted at
> the top of the logical element tree.
>
>> I ask because my first thought was that the sort of communication you
>> want seems like it would be just between a filesystem and the block
>> device that it talks directly to, and as you are particularly
>> interested in XFS and XVM, should could come up with whatever protocol
>> you want for those two to talk to either other, prototype it, iron out
>> all the issues, then say "We've got this really cool thing to make
>> snapshots much faster - wanna share?" and thus be presenting from a
>> position of more strength (the old 'code talks' mantra).
>
> Indeed, code talks ;-) I was hoping someone else would do that
> talking for me, though.
>
>>> First we need a mechanism to indicate that a bio is a hint rather
>>> than a real IO. Perhaps the easiest way is to add a new flag to
>>> the bi_rw field:
>>>
>>> #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */
>> Reminds me of the new approach to issue_flush_fn which is just to have
>> a zero-length barrier bio (is that implemented yet? I lost track).
>> But different as a zero length barrier has zero length, and your hints
>> have a very meaningful length.
>
> Yes.
>
>>> Next we'll need three bio hints types with the following semantics.
>>>
>>> BIO_HINT_ALLOCATE
>>> The bio's block extent will soon be written by the filesystem
>>> and any COW that may be necessary to achieve that should begin
>>> now. If the COW is going to fail, the bio should fail. Note
>>> that this provides a way for the filesystem to manage when and
>>> how failures to COW are reported.
>> Would it make sense to allow the bi_sector to be changed by the device
>> and to have that change honoured.
>> i.e. "Please allocate 128 blocks, maybe 'here'"
>> "OK, 128 blocks allocated, but they are actually over 'there'".
>
> That wasn't the expectation at all. Perhaps "allocate" is a poor
> name. "I have just allocated, deal with it" might be more appropriate.
> Perhaps BIO_HINT_WILLUSE or something.
>
>> If the device is tracking what space is and isn't used, it might make
>> life easier for it to do the allocation. Maybe even have a variant
>> "Allocate 128 blocks, I don't care where".
>
> That kind of thing might perhaps be useful for flash, but I think
> current filesystems would have conniptions.
>
>> Is this bio supposed to block until the copy has happened? Or only
>> until the space of the copy has been allocated and possibly committed?
>
> The latter. The writes following will block until the COW has
> completed, or might be performed sufficiently later that the COW
> has meanwhile completed (I think this implies an extra state in the
> snapshot metadata to avoid double-COWing). The point of the hint is
> to allow the snapshot code to test for running out of repo space and
> report that failure at a time when the filesystem is able to handle
> it gracefully.
>
>> Or must it return without doing any IO at all?
>
> I would expect it would be a useful optimisation to start the IO but
> not wait for it's completion, but that the first implementation would
> just do a space check.
>
>>> BIO_HINT_RELEASE
>>> The bio's block extent is no longer in use by the filesystem
>>> and will not be read in the future. Any storage used to back
>>> the extent may be released without any threat to filesystem
>>> or data integrity.
>> If the allocation unit of the storage device (e.g. a few MB) does not
>> match the allocation unit of the filesystem (e.g. a few KB) then for
>> this to be useful either the storage device must start recording tiny
>> allocations, or the filesystem should re-release areas as they grow.
>> i.e. when releasing a range of a device, look in the filesystem's usage
>> records for the largest surrounding free space, and release all of that.
>
> Good point. I was planning on ignoring this problem :-/ Given that
> current snapshot implementations waste *all* the blocks in deleted
> files, it would be an improvement to scavenge the blocks in large
> extents. This is especially true for XFS which goes to some effort
> to achieve large linear extents.
>
>> Would this be a burden on the filesystems?
>
> I think so. I would hope the hints could be done in a way which
> minimises the impact on filesystems, so that it would be easier to roll
> out. That implies pushing the responsibility for being smart about
> combining partial deallocations down to the block device/snapshot code.
> Any comments, Roger?