UBIFS is described as, "a new flash file system which is designed to work on top of UBI." It has replaced the JFFS3 project, a choice explained on the project webpage, "we have realized that creating a scalable flash file system on top of bare flash is a difficult task, just because the flash media is so problematic (wear-leveling, bad eraseblocks). We have tried this way, and it turned out to be that we solved media problems, instead of concentrating on file system issues. So we decided to split one big and complex tasks into 2 sub-tasks: UBI solves the media problems, like bad eraseblocks and wear-leveling, and UBIFS implements the file system on top. And now finally, we may concentrate on file-system issues: implementing write-back caching, multi-headed journal, garbage collector, indexing information management and so on. There are a lot of FS problems to solve - orphaned files, deletions, recoverability after unclean reboots and so on."
In a recent posting to the lkml [1], Artem Bityutskiy noted that UBIFS has to take into account that there is a small amount of unused block space at the ends of eraseblocks, and the size of pages written to disk are smaller than they are in memory as the filesystem performs compression. "So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage." The calculation is necessary as even though data is not written immdiately to the flash device, it's important to be able to inform the application writing data if there's not enough space left. "So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), ->link(), etc)?"
Andrew Morton acknowledged, "this is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well)." He added, "I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc." Digging into the ext4 code deeper, Andrew added, "common ideas need to be found and implemented in the VFS. The ext4 patches do it all in the fs which is just wrong. The tracking of reservations (or worst-case utilisation) is surely common across these two implementations? Quite possibly the ENOSPC-time forced writeback is too."
From: Artem Bityutskiy <dedekind@...>
Subject: Write-back from inside FS - need suggestions
[1]Date: Sep 28, 5:16 am 2007
Hi,
we are writing anew flash FS (UBIFS) and need some advise/suggestion.
Brief FS info and the code are available at
http://www.linux-mtd.infradead.org/doc/ubifs.html [2].
At any point of time we may have a plenty of cached stuff which have to
be written back later to the flash media: dirty pages an dirty inodes.
This is what we call "liability" - current set of dirty pages and
inodes UBIFS must be able to write back on demand.
The problem is that we cannot do accurate flash space accounting due
to several reasons:
1. Wastage - some smal random amount of flash space at ends or
eraseblocks cannot be used.
2. Compression - we do not know how well will the pages be compressed,
so we do not know how much flash space will they consume.
So, if our current liability is X, we do not know exactly how much
flash space (Y) it will take. All we can do is to introduce some
pessimistic, worst-case function Y = F(X). This pessimistic function
assumes that pages won't be compressible, and it assumes worst-case
wastage. In real life it is hardly going to happen, but possible.
The functiion is really bad and may lead to huge over-estimations
like 40%.
So, if we are, say, in ->prepare_write(), we have to decide whether
there is enough flash space to write-back this page later. We do not
want to fail with -ENOSPC when,say, pdflush writes the page back. So
we use our pessimistic function F(X) to decide whether we have enough
space or not. If there is a plenty of flash space, the F(X) says "yes",
and just we proceed. The question is what do we do if F(X) says "no"?
If we just return -ENOSPC, the flash space utilization becomes too
poor, just because F() is really rough. We do have space in most
real-life cases. All we have to do in this case is to lessen our
liability. IOW, we have to flush few dirty inodes/pages, then we'd
be able to proceed.
So my question is: how can we flush _few_ oldest dirty pages/inodes
while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(),
->link(), etc)?
I failed to find VFS calls which would do this. Stuff like
sync_sb_inodes() is not exactly what we need. Should we implement
a similar function? Since we have to call it from inside UBIFS, which
means we are holding i_mutex and the inode is locked, the function
has to be smart enough not to wait on this inode, but wait on other
inodes if needed.
A solution like kicking pdflush to do the job and wait on a waitqueue
would probably also work, but I'd prefer to do this from the context
of current task.
Should we have our own list of inodes and call write_inode_now() for
dirty ones? But I'd prefer to let VFS pick oldest victims.
So I'm asking for ideas which would work and be acceptable by the
community later.
Thanks!
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
-
From: Andrew Morton <akpm@...>
Subject: Re: Write-back from inside FS - need suggestions
[2]Date: Sep 28, 6:29 am 2007
On Fri, 28 Sep 2007 12:16:54 +0300 Artem Bityutskiy <dedekind@yandex.ru> wrote:
> Hi,
>
> we are writing anew flash FS (UBIFS) and need some advise/suggestion.
> Brief FS info and the code are available at
> http://www.linux-mtd.infradead.org/doc/ubifs.html [3].
>
> At any point of time we may have a plenty of cached stuff which have to
> be written back later to the flash media: dirty pages an dirty inodes.
> This is what we call "liability" - current set of dirty pages and
> inodes UBIFS must be able to write back on demand.
>
> The problem is that we cannot do accurate flash space accounting due
> to several reasons:
> 1. Wastage - some smal random amount of flash space at ends or
> eraseblocks cannot be used.
> 2. Compression - we do not know how well will the pages be compressed,
> so we do not know how much flash space will they consume.
>
> So, if our current liability is X, we do not know exactly how much
> flash space (Y) it will take. All we can do is to introduce some
> pessimistic, worst-case function Y = F(X). This pessimistic function
> assumes that pages won't be compressible, and it assumes worst-case
> wastage. In real life it is hardly going to happen, but possible.
> The functiion is really bad and may lead to huge over-estimations
> like 40%.
>
> So, if we are, say, in ->prepare_write(), we have to decide whether
> there is enough flash space to write-back this page later. We do not
> want to fail with -ENOSPC when,say, pdflush writes the page back. So
> we use our pessimistic function F(X) to decide whether we have enough
> space or not. If there is a plenty of flash space, the F(X) says "yes",
> and just we proceed. The question is what do we do if F(X) says "no"?
>
> If we just return -ENOSPC, the flash space utilization becomes too
> poor, just because F() is really rough. We do have space in most
> real-life cases. All we have to do in this case is to lessen our
> liability. IOW, we have to flush few dirty inodes/pages, then we'd
> be able to proceed.
>
> So my question is: how can we flush _few_ oldest dirty pages/inodes
> while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(),
> ->link(), etc)?
>
> I failed to find VFS calls which would do this. Stuff like
> sync_sb_inodes() is not exactly what we need. Should we implement
> a similar function? Since we have to call it from inside UBIFS, which
> means we are holding i_mutex and the inode is locked, the function
> has to be smart enough not to wait on this inode, but wait on other
> inodes if needed.
>
> A solution like kicking pdflush to do the job and wait on a waitqueue
> would probably also work, but I'd prefer to do this from the context
> of current task.
>
> Should we have our own list of inodes and call write_inode_now() for
> dirty ones? But I'd prefer to let VFS pick oldest victims.
>
> So I'm asking for ideas which would work and be acceptable by the
> community later.
>
This is precisely the problem which needs to be solved for delayed
allocation on ext2/3/4. This is because it is infeasible to work out how
much disk space an ext2 pagecache page will take to write out (it will
require zero to three indirect blocks as well).
When I did delalloc-for-ext2, umm, six years ago I did
maximally-pessimistic in-memory space accounting and I think I just ran a
superblock-wide sync operation when ENOSPC was about to happen. That
caused all the pessimistic reservations to be collapsed into real ones,
releasing space. So as the disk neared a real ENOSPC, the syncs becaome
more frequent. But the overhead was small.
I expect that a similar thing was done in the ext4 delayed allocation
patches - you should take a look at that and see what can be
shared/generalised/etc.
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ [4]
Although, judging by the comment in here:
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4... [5]
+ * TODO:
+ * MUST:
+ * - flush dirty pages in -ENOSPC case in order to free reserved blocks
things need a bit more work. Hopefully that's a dead comment.
<looks>
omigod, that thing has gone and done a clone-and-own on half the VFS.
Anyway, I doubt if you'll be able to find a design description anyway
but you should spend some time picking it apart. It is the same problem..
-
From: Artem Bityutskiy <dedekind@...>
Subject: Re: Write-back from inside FS - need suggestions
[5]Date: Sep 29, 5:56 am 2007
Andrew Morton wrote:
> This is precisely the problem which needs to be solved for delayed
> allocation on ext2/3/4. This is because it is infeasible to work out how
> much disk space an ext2 pagecache page will take to write out (it will
> require zero to three indirect blocks as well).
>
> When I did delalloc-for-ext2, umm, six years ago I did
> maximally-pessimistic in-memory space accounting and I think I just ran a
> superblock-wide sync operation when ENOSPC was about to happen. That
> caused all the pessimistic reservations to be collapsed into real ones,
> releasing space. So as the disk neared a real ENOSPC, the syncs becaome
> more frequent. But the overhead was small.
>
> I expect that a similar thing was done in the ext4 delayed allocation
> patches - you should take a look at that and see what can be
> shared/generalised/etc.
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ [6]
>
> Although, judging by the comment in here:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4... [7]
>
> + * TODO:
> + * MUST:
> + * - flush dirty pages in -ENOSPC case in order to free reserved blocks
>
> things need a bit more work. Hopefully that's a dead comment.
>
> <looks>
>
> omigod, that thing has gone and done a clone-and-own on half the VFS.
> Anyway, I doubt if you'll be able to find a design description anyway
> but you should spend some time picking it apart. It is the same problem..
(For some reasons I haven't got your answer in my mailbox, found it in
archives)
Thank you for these pointers. I was looking at ext4 code and found haven't
found what they do in these cases. I think I need some hints to realize
what's going on there. Our FS is so different from traditional ones
- e.g., we do not use buffer heads, we do not have block device
underneath, etc, so I even doubt I can borrow anything from ext4.
I have impression that I just have to implement my own list of
inodes and my own victim-picking policies. Although I still think it
should better be done on VFS level, because it has all these LRU lists,
and I'd duplicate things.
Nevertheless, I add Teo on CC in a hope he'll give me some pointers.
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
-
From: Andrew Morton <akpm@...>
Subject: Re: Write-back from inside FS - need suggestions
[7]Date: Sep 29, 6:39 am 2007
On Sat, 29 Sep 2007 12:56:55 +0300 Artem Bityutskiy <dedekind@yandex.ru> wrote:
> Andrew Morton wrote:
> > This is precisely the problem which needs to be solved for delayed
> > allocation on ext2/3/4. This is because it is infeasible to work out how
> > much disk space an ext2 pagecache page will take to write out (it will
> > require zero to three indirect blocks as well).
> >
> > When I did delalloc-for-ext2, umm, six years ago I did
> > maximally-pessimistic in-memory space accounting and I think I just ran a
> > superblock-wide sync operation when ENOSPC was about to happen. That
> > caused all the pessimistic reservations to be collapsed into real ones,
> > releasing space. So as the disk neared a real ENOSPC, the syncs becaome
> > more frequent. But the overhead was small.
> >
> > I expect that a similar thing was done in the ext4 delayed allocation
> > patches - you should take a look at that and see what can be
> > shared/generalised/etc.
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ [8]
> >
> > Although, judging by the comment in here:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4... [9]
> >
> > + * TODO:
> > + * MUST:
> > + * - flush dirty pages in -ENOSPC case in order to free reserved blocks
> >
> > things need a bit more work. Hopefully that's a dead comment.
> >
> > <looks>
> >
> > omigod, that thing has gone and done a clone-and-own on half the VFS.
> > Anyway, I doubt if you'll be able to find a design description anyway
> > but you should spend some time picking it apart. It is the same problem..
>
> (For some reasons I haven't got your answer in my mailbox, found it in
> archives)
>
> Thank you for these pointers. I was looking at ext4 code and found haven't
> found what they do in these cases.
I don't think it's written yet. Not in those patches, at least.
> I think I need some hints to realize
> what's going on there. Our FS is so different from traditional ones
> - e.g., we do not use buffer heads, we do not have block device
> underneath, etc, so I even doubt I can borrow anything from ext4.
Common ideas need to be found and implemented in the VFS. The ext4 patches
do it all in the fs which is just wrong.
The tracking of reservations (or worst-case utilisation) is surely common
across these two implementations? Quite possibly the ENOSPC-time forced
writeback is too.
> I have impression that I just have to implement my own list of
> inodes and my own victim-picking policies. Although I still think it
> should better be done on VFS level, because it has all these LRU lists,
> and I'd duplicate things.
I'd have thought that a suitable wrapper around a suitably-modified
sync_sb_inodes() would be appropriate for both filesystems?
-
Related links:
- Archive of above thread [9]