way. There was a lot of obsessing over the wisdom of the xattr atom
idea vs storing the literal ascii xattr names. A nasty problem with
atoms is, how can we know the filesystem will never fill up with atom
say, just let the on-disk atom table grow as new attribute types arrive
traditional method of storing literal ascii strings. It looks like
need to apply are almost the same as for file atimes. So there is some
design synergy there, which helps ameliorate the pain.I have an argument for going with solution 1 or 3, maybe depending on
filesystem size: we basically have to keep a copy of the atom table in
memory at all times, otherwise we have to fetch a bunch of blocks
everytime someone calls setxattr(). I see two general cases for
filesystems:
1. filesystem size >> RAM: this is the common case for everything except
USB thumb drives, and if the atom table is large enough to matter in terms
of disk usage, it's already getting cramped by main memory
2. filesystem size <= RAM: for all modern media, this is small enough
that garbage collection should feasible in a reasonable amount of timeSo if we accept that caching constraints for the atom table reasonably
bound it by main memory size, it seems to me that there is no reason to
bother with reference counting. (Plus, that means that the atom table
takes up one less word per atom.)All that said, this may simply be an argument against having a global atom
table at all: the atom table constitutes an unprivileged DoS. Especially
since I can't think of a sane way to charge quota for shared xattr names.Maybe the solution is to have the xattr atom table be per-directory or
per-user? (A per-user atom table could then be integrated with the
per-user quota information and also makes it crystal clear who gets
charged for the atom table blocks. But then it might make sense to
refcount or GC atoms again, so that users can reclaim their quota blocks.
Hmm. Or maybe 1-bit refcounting: any name used more than once never goes
a...
Hi,
The free bitmap is called from fill_segs(). The page for free bitmap
may be used recursively. We have to care this point.lock_page(bitmap page (e.g. index == 3))
->writepage(bitmap page)
tux3_get_block()
balloc() (allocate blocks for that bitmap page)
/* we are starting to find free block on bitmap */
while (end of page)
blockread(bitmap page)
lock_page(bitmap page (index == 3))Current blockread() have the above deadlock. This patchset is including
the hack to avoid this. Later, we should revisit blockread() to fix
properly.Second one is dleaf_merge() fix. Now, it don't care about
MAX_GROUP_ENTRIES at all. This fixes it.static-http://userweb.kernel.org/~hirofumi/tux3/
Please review those.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Done. A complete rewrite of dleaf_merge, saving two or three lines
from my original code :-)OK, announcement time: with these changes, Hirofumi has got Tux3
running stable through multiple runs of the fsx-linux filesystem
exerciser, so this is a big day. And that would be Hirofumi 'Hero'
Ogawa to me.I will prepare a new patch, and a post to invite others to try
testing this code. Next, when we have a basic atomic commit working
it will be about time to offer Tux3 up for review.Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
Actually, copying group/entry/extent are your original version. Those
are just tweaked for new local values.I modified code to keep vfrom leaf valid.
static-http://userweb.kernel.org/~hirofumi/tux3/
Please review and pull it again.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
It is not as bad as that, it is just a probe or two into the buffer
cache, where each probe is just a lookup in a very small radix tree,
normally a single level. We could put a small hash in front of that
if it matters, which would save not only the radix tree probe but the
iteration through the dentry block, and replace it with iterating down
a hash chain. Not a hugely obvious difference I expect, but also
pretty easy to write, so if/when it gets to be the most annoying thingBut you can't place any bound on the size of the atom table unless
Without reference counting the atom table is a DoS, that is why
reference counting or something equivalent is not an option. And it is
indeed an argument against atoms, not because of efficiency but because
of complexity. So long as the complexity does not blow up then I think
atoms are a win. Maybe a big win if you think about the compressionThat is a good idea that had not occured to me. I will regard it as a
fallback if reference counting does not work out. But I think reference
counting will work out just fine, and it is the cleanest solution imho.Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
Hi Benjamin,
So now we have an actual implementation of the xattr atom idea in
place, which makes the annoying little questions much less theoretical
and more pressing. My thinking has evolved a little:* Now we know that the incremental amount of code needed for
xattr name deduplication is really tiny, just 20 or 30 lines.* But we also need reference counting to make it nice, which is
going to cost more code and some performance.* The only immediately measurable benefits to xattr atoms are space
reduction and cache pressure reduction. So this is just a space
optimization, against which we have additional code complexity
and performance impact.These thoughts seem to point directly at an inescapable conclusion:
make it optional as has been suggested. There are of course infinite
variations on optional. I will touch on a couple that immediately
come to mind.One option is to trust root as you suggested, to create only the xattrs
the system actually needs, for example, acls. Other users might be
similarly trusted. How would we know? A side channel message to Tux3,
like an ioctl or a ddlink message? Rely on the uid? Other?Another option would be to have a global on/off mount option for xattr
name deduplication. Kind of an obscure option, maybe. We could
instead cast that as a generic "try to compress" level, so Tux3 could
have a compression level option. 0 -> none, 1 -> compress xattr names,
2 -> compress xattr bodies too, 6 -> use gzip all over the place, 9 ->
go nuts, or something like that.Whatever we settle on, I do not think the effort that went into xattr
name deduplication is in any way wasted. It only cost an hour or two
of development - the time to write emails about it was actually a lot
more, and that was fun. Even better, the code to skirt around the atom
optimization and directly store the xattr names in inodes looks like it
will be really easy and short. We do everything just as it is done for
at...
OK, I have decided to take a run at implementing reference counting for
atoms: the correct, no compromise approach, and maybe also efficient
enough that we do not have to make it optional.The plan is this:
* We have an atom reference count table which is broken into two
parts, a table of the low 16 bits of the atom count, and a
table of the high 48 bits that receives carries and borrows from
the low 16 bit table.* Both tables are mapped into the atom table at a high logical
offset. Allowing 32 bits worth of atom numbers, and with at most
256 atom entries per 4K dirent block, we need at most
(32 << 8) = 1 TB dirent bytes for the atom dictionary, so the
count tables start at block number 2^40 >> 12 = 2^28.* The low end count table needs 2^33 bytes at most, or 2^21 blocks,
so the high count table starts just above it at 2^28 + 2^21
blocks.* This all works for blocksize down to 2^9, and breaks at 2^8, but
that is ok I think.These blocks sit in the block cache (->map in Tux3 fake page cache,
->mapping in kernel) in a sparse file of course, so we do not actually
use huge amounts of space, just take advantage of the logical address
mapping to avoid having several files that which would have to be
created and synced separately.So to update an atom count we need to bread the right count block,
add to the count, carry over into the high order count map if necessary
(very rare) then we are done. If we care about the cost of hashing
into the map to find the block then we hold a use count on and a
pointer to the last count block accessed. To start with we don't care
about that.This leaves us with a dirty buffer or two in the atable, and we keep
doing that for a while until it comes time to sync to disk. We will
mostly likely just have one block dirty, the count block, or two if
somebody created new xattrs. We will not actually write out the count
block(s) that changed, but instead a list of [atom, +-c...
The 1 TB size is correct, but the reason is woefully wrong. It should
be: Allowing 32 bits worth of atom numbers, and with a 255 byte name
for every xattr...This could actually use a little more than 1 TB, which I am not going
to sweat about: we will fail with ENOSPC for any attempt to expand the
dirent part of the atom table past 1 TB, which if it ever happens is
because somebody had permission to (and thus no quota) and did it
intentionally, which falls into the same category as just filling up
the entire fs with a big file.The other necessary component of the atom table I forgot to mention is
the reverse map, which maps atom numbers back to dirents so we can
implement xattr listing efficiently. When a new atom dirent is
created, we also set the reverse map for the dirent's atom number to
the file offset at which the dirent was created. This will be 64 bits
if we want to be lazy, which I do here, so that is 2^32 atoms * 8 bytes
= 2^35 revmap bytes = 2^35 >> 12 blocks = 2^23 blocks. We locate this
just above the count table (low + high part), which puts it at logical
offset 2^28 + 2^23, since the refcount table is also (by coincidence)
2^23 bytes in size.An alternative to the reverse map approach would be to use the dirent
offset directly as the atom number, which gives a less dense mapping of
atom numbers and thus somewhat less compression, with slightly simpler
code in return and less work to do the first time an xattr is set on
any inode. But since the code and efficiency difference verges on
negligible, I will err on the side of better compression and go with the
simple minded reverse map. Keep in mind that the limits I'm designing
for are way, way, way higher than what will be used in practice. I
just do not want this design being attacked on the basis of something
ASCII strings can do that atoms cannot. I guess we are about there at
this point, but of course the proof will be in the implementation,
which ought to land in the not too far distant future....
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 009/196] Chinese: add translation of sparse.txt |
| Andrew Morton | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
| Stephen Rothwell | Announce: Linux-next (Or Andrew's dream :-)) |
git: | |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Wenji Wu | A Linux TCP SACK Question |
