The biggest bitmap What is the largest volume we can map with a bitmap? On a 32 bit system, it is 2^32 page cache blocks * 2^12 bytes per block * 2^3 bits per byte * 2^12 block size = 2^59. One bit short of the promised exabyte maximum Tux3 volume size. The limitation is the size of the page cache index on a 32 bit system; on a 64 bit system there is no such limitation. Even if a filesystem is created on a 64 bit system it still should be readable on a 32 bit system. We could just say for now that bitmap allocation is avoided in favor of the free extent tree as described in the previous free tree post, only for the part of a volume beyond half an exabyte. Not that anybody is likely to test that in the near future! Another fix is to allocate two kernel mappings, one for the lower half of the bitmap table and the other for the upper half. Or we could be lame and just declare that Tux3 is limited to half an exabyte on 32 bit systems. There is plenty of time to solve this little issue, I just thought I would mention it now for completeness. More on bitmap recursion Putting the bitmap into a file is kind of an exciting thing to do, not necessarily in the good sense of exciting. Flushing the bitmap to disk will occasionally cause allocations either of physical blocks to hold newly instantiated bitmap blocks (first bit set in a block that was previously a hole in the bitmap file) or more rarely, nodes and leaf block for the file index btree. So after flushing the bitmap to disk we may find that some buffered blocks of the bitmap are dirty again as a result of allocations performed during the flush. Hmm. One could possibly construct a case where arbitrarily many flushes are required to get all of the bitmap buffers into a clean state. Preallocating the entire bitmap to avoid this bizarre effect is not an option if I intend to go through with my plan to combine a sparse bitmap with an extent tree to get the best of both allocation methods. So what I am going to do is think about this issue a little before doing anything design wise or code wise. When I first started talking about Tux3, Maciej (lurking here on the list) waxed poetic about the virtues of using actual files to represent filesystem objects that happen to resemble files, and I expressed a general fear of strange recursions down that path. Well here we have a nice example of one of those. And I am going to press on and deal with it. So I guess Maciej wins that point... provided this self dirtying behaviour can be tamed elegantly. Filesystem bootstrap Creating a new Tux3 filesystem requires allocating a number of objects, including objects involved in allocation. Another nice recursion there: you have to allocate space for objects, but the block allocator is not initialized yet. This is typically handled by laying out all those initial allocations by hand, using ad hoc fully written out code in a special mkfs.<filesystem> utility. Ddsnap does this, and though we tried hard, that initial create code looks ugly and is unpleasant to work on. I do not doubt that most filesystems have this issue. Tux3 has a nice solution now: because of the nice separation between cached buffer mappings and on disk representation, the allocator can be initialized just by creating a new inode, which itself does not require any block allocations. If you look at the main routine in inode.c you see: * Create the device, device buffer map and superblock descriptors * Initialize the kernel buffer cache emulation * Create the bitmap inode and store it in the superblock * Block allocator is now functional Initialization continues by allocating the root of the inode table, at which point it is possible to create the root directory as a normal file create. The test code then goes on to create a file in the root directory, flush various maps to disk and so on. This is satisfactory indeed, and is likely to strongly resemble the production code. Allocation strategy The allocator in balloc.c currently implements a simple, cyclic strategy where the next allocation goal is always immediately after the last one, with wraparound to the bottom of the volume. This is a horribly deficient way to do things. For one thing, it could take a very long time to find a free block if all blocks in a region happen to be allocated or the filesystem is nearly full. For another, mixing together deletes and creates causes extreme fragmentation with this strategy. Not as bad as completely random allocation, but close to it. A production version of Tux3 needs a reasonably sophisticated allocation strategy in order to be at all usable with rotating media. I am going to be returning to this question over and over again as Tux3 matures. For today, Tux3 has enough allocation machinery to show itself as a prototype, but not to survive head to head benchmarking in good style. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
