[Tux3] Page cache emulation now up and running

Previous thread: [Tux3] Four kinds of inode data attributes by Daniel Phillips on Saturday, August 9, 2008 - 3:05 am. (2 messages)

Next thread: [Tux3] Structure of Tux3 by Daniel Phillips on Tuesday, August 12, 2008 - 1:40 am. (5 messages)
To: <tux3@...>
Date: Monday, December 1, 2008 - 10:38 pm

I was randomly reading
http://tux3.org/pipermail/tux3/2008-September/000186.html
for pleasure and noticed a possible latent corruption bug in Daniel
Phillips's post of the atom-refcount-update procedure (below).

If an i/o error occurs while reading the block containing the upper-16
bits of refcount, the procedure nevertheless updates the low-16 bits of
refcount on disk, and then returns an EIO error. The fix probably
requires bread-ing both blocks into separate buffers before modifying
either (only one block if the upper-16 remain zero, of course).

I don't know anything about tux3, so perhaps some higher-level mechanism
un-does the incorrect update of the low-16 refcount after the EIO is
returned. But if not, a flaky disk (or network link to a disk) might
result in the refcount being silently reduced by 65535.

For reference, here is Daniel's code from the post (without endien-ness
stuff):

int use_atom(struct inode *inode, atom_t atom, int use)
{
unsigned shift = inode->sb->blockbits - 1;
unsigned block = inode->sb->atomref_base + 2 * (atom >> shift);
unsigned offset = atom & ~(-1 << shift);
struct buffer *buffer;

if (!(buffer = bread(inode->map, block)))
return -EIO;
int low = ((u16 *)buffer->data)[offset] + use;
((u16 *)buffer->data)[offset] = low;
if ((low & (-1 << 16))) {
brelse_dirty(buffer);
if (!(buffer = bread(inode->map, block + 1)))
return -EIO; // <********************** BUG ?
((u16 *)buffer->data)[offset] += low >> 16;
}
brelse_dirty(buffer);
return 0;
}

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: <tux3@...>
Date: Sunday, August 10, 2008 - 7:14 am

The tux3 code is starting to get a little more interesting now.

Another significant part of the kernel vfs machinery is now emulated:
the page cache. Which in userland does not have to care about pages,
so it works in buffers instead, and instead of a special index field
we just reinterpret the block field as an index. This is the way the
Linux kernel does it anyway (actually I dimly recall that change came
about in Linux as a response to the way I handled directory block
access in htree) but this is obscured by the confusion between block
and page units of granularity. The kernel uses a page cache with a
direct device mapping to implement the buffer cache, while my code
uses the buffer cache to implement the equivalent of the kernel page
cache. The latter is a much cleaner way to do things, but is not an
option for the kernel until we get around to generalizing page size.

Anyway, both in kernel and in my emulation, the block interface is
just the vererable BSD standard getblk and bread, with writing done
by marking buffers dirty and flushing them out en mass periodically.
Easy to understand and get right.

This page cache emulation effort is about making directory operations
happen. We want to scan sequentially through an entire directory file
looking for an entry or a place to create a new one. So we need a
notion of what sequentially means. One way is to walk an inode btree,
but that is not the way the kernel does things, and that would be a
problem when it comes time to port. What the kernel does is read each
block of a directory file into a page cache the first time somebody
accesses it, and after that, the cached block can be accessed very
efficiently without needing to read it again or having to go messing
around in the filesystem metadata. This is really, really powerful.
You do not want to go trying to reimplement that kind of caching at
the filesystem level, because you will just end up with a lot of code
that does not do nearly as good a job as the kernel page...

To: <tux3@...>, <james_avera@...>
Date: Tuesday, December 2, 2008 - 1:04 am

Looking at your post, it just occurred to me, we should change all our
breads from:

if (!(buffer = bread(inode->map, block)))
return -EIO;

to something like:

if (!(buffer = blockread(inode->map, block, &err)))
return err;

This way we can stop propagating the traditional silliness of inventing
our own error code when bread returns NULL. Our motto: "Tux3, we are
not part of the problem".

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: <tux3@...>
Cc: <james_avera@...>
Date: Tuesday, December 2, 2008 - 1:19 am

...or:

if (IS_ERR(buffer = blockread(inode->map, block)))
return PTR_ERR(buffer);

Anybody have an opinion on whether IS_ERR works just as well in user
space as in kernel?

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: Daniel Phillips <phillips@...>
Cc: <tux3@...>, <james_avera@...>
Date: Tuesday, December 2, 2008 - 5:44 am

It would work if kernel reserved that range as kernel space like linux.
Well, if we want to make sure it, I think we can reserve the address
range for it.

e.g. maybe

error_base = mmap(,MAX_ERRNO + 1,PROT_NONE,,,);

ERR_PTR(err)
assert(err < 0);
return error_base + -err;

PTR_ERR(ptr)
return -(long)(ptr - error_base);

IS_ERR(ptr)
return (error_base < ptr && ptr <= error_base + MAX_ERRNO);
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: OGAWA Hirofumi <hirofumi@...>
Cc: <tux3@...>, <james_avera@...>
Date: Tuesday, December 2, 2008 - 4:07 pm

Nice! I think we should do it, and get rid of some of those horrible
assumptions about NULL pointer returns, even if we can't fix every one
right now (in kernel we are stuck with NULL return for sb_bread, for
example).

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: <tux3@...>, <james_avera@...>
Date: Tuesday, December 2, 2008 - 12:56 am

Exactly. If we get EIO on a buffer update we will not commit the
transaction containing the atom update at all, and put the volume into
read-only mode just like Ext3 does. That is our first line of defense
anyway, we will refine that as we go. Ultimately, we want to be able to
"power through" a nasty like this and keep going, picking up the pieces
as we can. If that means walking all the inode table blocks in the
system to recount the xattr atoms, then so be it. And we can run for
quite some time without tracking atom counts at all, which is nice for
this particular case. We would map the refcount somewhere else (to a
different physical device hopefully) and continue, counting only the
_new_ reference counts, while gathering up the pre-existing ones from
the atom table. (To be precise, we track only new updates to the

Thanks for the sharp eyes :-)

It's not actually a bug for the reason above, but I did not explained
that before I think
, so until I did it was a bug.

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

Previous thread: [Tux3] Four kinds of inode data attributes by Daniel Phillips on Saturday, August 9, 2008 - 3:05 am. (2 messages)

Next thread: [Tux3] Structure of Tux3 by Daniel Phillips on Tuesday, August 12, 2008 - 1:40 am. (5 messages)