Hello everyone, I wasn't planning on releasing v0.12 yet, and it was supposed to have some initial support for multiple devices. But, I have made a number of performance fixes and small bug fixes, and I wanted to get them out there before the (destabilizing) work on multiple-devices took over. So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain is dramatically better random writes to existing files. In testing here, the random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to change the way back references for file extents were hashed. Other changes: Insert and delete multiple items at once in the btree where possible. Back references added more tree balances, and it showed up in a few benchmarks. With v0.12, backrefs have no real impact on performance. Optimize bio end_io routines. Btrfs was spending way too much CPU time in the bio end_io routines, leading to lock contention and other problems. Optimize read ahead during transaction commit. The old code was trying to read far too much at once, which made the end_io problems really stand out. mount -o ssd option, which clusters file data writes together regardless of the directory the files belong to. There are a number of other performance tweaks for SSD, aimed at clustering metadata and data writes to better take advantage of the hardware. mount -o max_inline=size option, to override the default max inline file data size (default is 8k). Any value up to the leaf size is allowed (default 16k). Simple -ENOSPC handling. Emphasis on simple, but it prevents accidentally filling the disk most of the time. With enough threads/procs banging on things, you can still easily crash the box. -chris -
From: Chris Mason <chris.mason@oracle.com>
Any page size larger than 4K will not work with btrfs. All of the
extent stuff assumes that PAGE_SIZE <= sectorsize.
I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
sparc64 and I was finally able to successfully mount a partition.
With 4K there are zero's in the root tree node header, because it's
extent's location on disk is at a sub-PAGE_SIZE multiple and the
extent code doesn't handle that.
You really need to start validating this stuff on other platforms.
Something that isn't little endian and something that doesn't use 4K
pages. I'm sure you have some powerpc parts around somewhere. :)
Anyways, here is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.
diff -u --recursive --new-file vanilla/btrfs-0.12/ctree.h btrfs-0.12/ctree.h
--- vanilla/btrfs-0.12/ctree.h 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/ctree.h 2008-02-10 17:17:49.000000000 -0800
@@ -495,22 +495,17 @@
#define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
static inline u##bits btrfs_##name(struct extent_buffer *eb) \
{ \
- char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \
- unsigned long offset = offsetof(type, member); \
- u##bits res; \
- __le##bits *tmp = (__le##bits *)(kaddr + offset); \
- res = le##bits##_to_cpu(*tmp); \
- kunmap_atomic(kaddr, KM_USER0); \
+ type *p = kmap_atomic(eb->first_page, KM_USER0); \
+ u##bits res = le##bits##_to_cpu(p->member); \
+ kunmap_atomic(p, KM_USER0); \
return res; \
} \
static inline void btrfs_set_##name(struct extent_buffer *eb, \
u##bits val) \
{ \
- char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \
- unsigned long offset = offsetof(type, member); \
- __le##bits *tmp = (__le##bits *)(kaddr + offset); \
- *tmp = cpu_to_le##bits(val); \
- kunmap_atomic(kaddr, KM_USER0); \
+ type *p = kmap_atomic(eb->first_page, KM_USER0); \
...Grin, I think around v0.4 I grabbed a ppc box for a day and got things working. There has been some churn since then... My first prio is the newest set of disk format changes, and then I'll sit down Many thanks, I'll try these out here and push them into the tree. -chris -
The CRC32C implementation in the btrfs progs is different from the one
in the kernel, so obviously nothing can possibly work on big-endian.
This is getting less and less fun by the minute, I simply wanted to
test btrfs on Niagara :-/
Here is a patch to fix that:
--- vanilla/btrfs-progs-0.12/crc32c.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/crc32c.c 2008-02-12 01:19:33.000000000 -0800
@@ -91,13 +91,11 @@ static const u32 crc32c_table[256] = {
* crc using table.
*/
-u32 crc32c_le(u32 seed, unsigned char const *data, size_t length)
+u32 crc32c_le(u32 crc, unsigned char const *data, size_t length)
{
- u32 crc = (__force __u32)(cpu_to_le32(seed));
-
while (length--)
crc =
crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
- return le32_to_cpu((__force __le32)crc);
+ return crc;
}
-Filesystems like ext2 put their superblock 1 block into the partition in order to avoid overwriting disk labels and other uglies. UFS does this too, as do several others. One of the few exceptions I've been able to find is XFS. This is a real issue on sparc where the default sun disk labels created use an initial partition where block zero aliases the disk label. It took me a few iterations before I figured out why every btrfs make would zero out my disk label :-/ -
From: David Miller <davem@davemloft.net>
Actually it seems this is only a problem with mkfs.btrfs, it clears
out the first 64 4K chunks of the disk for whatever reason.
The following patch cures the disk label spamming problem for me:
--- vanilla/btrfs-progs-0.12/mkfs.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/mkfs.c 2008-02-12 00:07:43.000000000 -0800
@@ -210,7 +210,8 @@ int main(int ac, char **av)
exit(1);
}
memset(buf, 0, sectorsize);
- for(i = 0; i < 64; i++) {
+ lseek(fd, BTRFS_SUPER_INFO_OFFSET, SEEK_SET);
+ for(i = BTRFS_SUPER_INFO_OFFSET / sectorsize; i < 64; i++) {
ret = write(fd, buf, sectorsize);
if (ret != sectorsize) {
fprintf(stderr, "unable to zero fill device\n");
-It is a good idea to remove supers from other filesystems. I also need to add zeroing at the end of the device as well. Looks like I misread the e2fs zeroing code. It zeros the whole external log device, and I assumed it also zero'd out the start of the main FS. So, if Btrfs starts zeroing at 1k, will that be acceptable for you? -chris -
From: Chris Mason <chris.mason@oracle.com> Sure. -
Something looks wrong here. Why would btrfs need to zero at all? Superblock at 0, and done. Just like xfs. (Yes, I had xfs on sparc before, so it's not like you NEED the whitespace at the start of a partition.) -
From: Jan Engelhardt <jengelh@computergmbh.de> So that existing superblocks on the partition won't be interpreted as correct by other filesystems. It's No, we won't do stupid things like that and make an entire cylinder of our disks unusable. See my other reply. -
The reason why we don't put the superblock at 0 is not because it screws over the sparc, but because on many systems (including x86) the bootsector is stored at 0. It's not hard for mke2fs to zap the boot sector which we do on all architectures *except* sparc, to avoid nuking the disk label. (Chris just missed the "#ifndef __sparc // #define ZAP_BOOTBLOCK // #endif" at the beginning of mke2fs.c) This is the best of all words; it makes sparc happy; it allows boot loaders to put the x86 standard initial stage 0 boot loader in the first 446 bytes of the disk; and by zapping sector 0 on all architectures except the sparc, it solves the previous filesystem "ghost traces" detection problem for filesystems like xfs that put the superblock at 0. - Ted -
From: Jan Engelhardt <jengelh@computergmbh.de> You actully do unless you want to lose significant chunks of your disk space. The Sun disk label only allows you to specify the start of a partition in cylinders, so if you want to use a filesystem like XFS you have to start the partition on cylinder 1 which can be many blocks into the disk. That entire first cylinder is completely wasted. What XFS does by putting the superblock at zero is simply does not take these kinds of issues into consideration. -
I don't believe a cylinder of wasted disk space is significant. I don't believe a cylinder of disk is worth adding the complexity of sharing a partition to a filesystem. That complexity translates into engineering time and mistakes. Your other arguments for making a hole in the filesystem, based on tradition, are more convincing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -
Ok you do have a point there. The GPT users win of course, since it uses LBA, not cyls, so the number of lost bytes is generally below a cyl. On the other hand, the H and S of CHS could be lowered and S increased, e.g. divide H by 2, divide S by 2, multiply S by 4. This gives a finer Well it was designed for a different system initially with a different style of booting. -
From: Jan Engelhardt <jengelh@computergmbh.de> That's really not an option when you only have 16-bits in the disk label for some of these values, which is the case for Sun disk labels and some others (f.e. BSD). Jan, please get past this, it's a real issue and we therefore have to skip the initial 1K or so in order to provide something even approaching sane. No modern filesystem should put it's superblock at position zero. -
I've had requests to move the super down to 64k to make room for bootloaders, which may not matter for sparc, but I don't really plan on different locations for different arches. 4k aligned is important given that sector sizes are growing. -chris -
From: Chris Mason <chris.mason@oracle.com> The Sun disk label sits in the first 512 bytes and the boot loader block sits in the second 512 bytes. I think leaving even more space is a good idea for several reasons. -
Yep. I chose 32K unused space in the prototype filesystem I wrote [1, 2.4 era]. I'm pretty sure I got that number from some other filesystem, maybe even some NTFS incarnation. It's just good practice to avoid the first and last "chunks" of a partition, FSVO chunk. Jeff [1] http://kernel.org/pub/linux/kernel/people/jgarzik/ibu/ -
NTFS superblock (and the partial mirror copy) can be anywhere except in the first blocks. That space is where the $BOOT file is placed which contains the bootstrap code and the BIOS Paramether Block which includes the NTFS signature and describes various filesystem information needed to locate the superblock, etc. Unlike mkfs.xfs which warns since at least 2002 and requires the -f option to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy them. Thank you for the bug report and let's hope the next generation of Sun hardwares won't scatter the firmware too into random places inside a partition encoded by a fictitious size of disk cylinder. Szaka -
In x86, there is even more space for a bootloader (some 28k or so) even if your partition table is as closely packed as possible, from 0 to 7e00 IIRC. For sparc you could have something like startlba endlba type sda1 0 2 1 Boot sda2 2 58 3 Whole disk sda3 58 90000 83 Linux and slap the bootloader into "MBR", just like on x86. Or I am missing something.. -
From: Jan Engelhardt <jengelh@computergmbh.de> You cannot specify partitions on LBA boundaries on sparc, the Sun disk label specifies the partition start points in cylinders. -
It was a request from hpa, and he clearly had something in mind. He kindly offered to review the disk format for bootloaders and other lower level issues but I asked him to wait until I firm it up a bit. From my point of view, 0 is a bad idea because it is very likely to conflict with other things. There are lots of things in the FS that need deep thought,and the perfect system to fully use the first 64k of a 1TB filesystem isn't quite at the top of my list right now ;) Regardless of offset, it is a good idea to mop up previous filesystems where possible, and a very good idea to align things on some sector boundary. Even going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD. -chris -
From: Chris Mason <chris.mason@oracle.com> Starting at 0 is a bad idea because otherwise you'll waste significant chunks of your disk on Sparc because of reasons I've outlined in other replies. What XFS does is really unfortunate, let's learn from it's mistake. -
I'd rather say what Sun did with their disklabels was rather unfortunate :) But yeah, new filesystem should cater for it's braindamage because it doesn't have any kind of runtime cost at all. XFS was designed for IRIX back then which had disklabels just like the SUN ones, just without the braindamage of having the disklabel inside the partition.. -
> I still don't like the idea of btrfs trying to be smarter than a user who can partition up his system according to (a) his likes (b) system or hardware requirements or recommendations to align the superblock to a specific location. 1MB alignment does not always mean 1MB alignment. Sector 1 begins at 0x7e00 on x86. And with the maximum CHS geometry (255/63), partitions begin at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on a 1048576 boundary. -
From: Jan Engelhardt <jengelh@computergmbh.de> All of your beliefs are unfortunately without the understanding of restrictions that exist in several partition layouts such as the Sun disk label one. You have to start the superblock somewhere other than zero or else you lose a huge chunk of your disk, and furthermore a zero based partition is what all of the Sun disk label creating programs make by default. "I make a default disk label, I put btrfs or XFS on there, my disk label is gone." Real intuitive. -
x86 MSDOS partition table layout starts counting with sector 1, which is (not so intuitively) starting at 0x7e00 (and there's no sector 0, probably for safety). Well, each ptable format with its own quirks. -
I haven't followed this thread, but in case it matters -- this sounds fairly confused. Not sure what you're saying, but the MSDOS partition table has its root table in the very first sector on the disk, at offset 0x1be = 0x200 - 4 * sizeof(struct partition) - 2 (that is, 4 entries at the end of that first sector, followed by a 2-byte signature). That 0x7e00 that you are speaking of sounds somewhat like the _memory_ address the BIOS loads that first sector to: 0x7c00. It then jumps there to start the ball rolling but 0x7c00 is not an on-disk reality or anything. MS-DOS partition tables are furthermore fully outside the actual partitions themselves and as such I believe not all together relevant to the issue? (as said, not following along though...) Rene -
From: Jan Engelhardt <jengelh@computergmbh.de> You can go update every sparc machine's firmware then, you can't boot off of a device that doesn't use a sun disk label. Your expectations are unrealistic, and it's just simpler to do something sane and follow the majority of existing practice by putting the superblock a few blocks into the partition. -
Will all the users in the world who think about super block location when they partition their disks please raise their hands? The location of the super block needs to be very simple in order for mount and friends to find and detect it. It needs a simple algorithm to try multiple locations in case a given copy of the super is corrupt. Design in this case is a bunch of compromises around other users of the hardware, ease of programming, and the benefits in performance or usability IO is already aligned on sectors, sometimes we'll have a perfect erasure block alignment and sometimes not. When the location of the super is my biggest bottleneck, I'll be a very happy boy. -chris -
From: Chris Mason <chris.mason@oracle.com>
I couldn't even make a filesystem on sparc64 without the following
patch.
The first problem is that these SETGET macros lose typing information,
and therefore can't see the 'packed' attribute and therefore take
unaligned access SIGBUS signals on sparc64 when trying to derefernce
the member.
The next problem is a similar issue in btrfs_name_hash(). This gets
passed things like &key.offset which is a member of a packed
structure, losing this packed'ness information btrfs_name_hash()
performs a potentially unaligned memory access, again resulting in a
SIGBUS.
This function never returns an error, so the simplest fix was to
return the hash value which avoids all of the issues. In attempting
other schemes to fix this, I found it very difficult to give gcc
a packed attribute for that "u64 *" argument other than to create
some new pseudo structure which would have been ugly.
Similar code lives in the btrfs kernel code too, I'll try to get a
partition at least mounted and working minimally and if successful
I'll send you patches for that too.
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/ctree.h btrfs-progs-0.12/ctree.h
--- vanilla/btrfs-progs-0.12/ctree.h 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/ctree.h 2008-02-10 16:53:24.000000000 -0800
@@ -451,18 +451,16 @@ static inline void btrfs_set_##name(stru
static inline u##bits btrfs_##name(struct extent_buffer *eb, \
type *s) \
{ \
- unsigned long offset = (unsigned long)s + \
- offsetof(type, member); \
- __le##bits *tmp = (__le##bits *)(eb->data + offset); \
- return le##bits##_to_cpu(*tmp); \
+ unsigned long offset = (unsigned long)s; \
+ type *p = (type *) (eb->data + offset); \
+ return le##bits##_to_cpu(p->member); \
} \
static inline void btrfs_set_##name(struct extent_buffer *eb, \
type *s, u##bits val) \
{ \
- unsigned long offset = (unsigned long)s + \
...Many thanks, I clearly didn't put enough thought into the unaligned access The kernel is actually worse, because the set/get macros are more complex. Some live in ctree.h like in the progs, but the nasty ones live in struct-funcs.c -chris -
From: Chris Mason <chris.mason@oracle.com> This is really problematic, because you've got these things called "btrfs_item_ptr()" which really isn't a pointer, it's a relative 'unsigned long' offset cast to a pointer. The source of this seems to be btrfs_leaf_data(). And then those things get passed down into the SETGET functions! Then deeper down we have terribly inconsistent things like btrfs_item_nr_offset() and btrfs_item_offset_nr(). Sigh... I'll see what I can do. -
Explaining it won't make it pretty, but at least I can tell you what the code does. This is all part of the btrfs code that supports tree block sizes larger than a page. The extent_buffer code (extent_io.c) provides a read/write api into an extent_buffer based on offsets from the start of the multi-page buffer. That's where the relative unsigned long comes from. The part where I cast it to pointers is me trying to maintain type checking throughout the code. The pointers themselves are useless, they need to be matched with an extent_buffer to actually get to the bytes. There are a few parts where the SETGET funcs are open coded, mostly in very Btree blocks have the offset of the item header from the start of the block Thanks -chris -
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Brice Goglin | Re: Linux 2.6.21-rc2 |
| Christian Volkmann | 2.6.23-rc6: usb 1-1: device not accepting address 2, error -62 |
git: | |
| Jon Smirl | Re: VCS comparison table |
| Nicolas Pitre | Re: git to libgit2 code relicensing |
| Giuseppe Bilotta | [PATCHv3 0/4] gitweb: remote heads feature |
| Mark Burton | Git commit won't add an untracked file given on the command line |
| Nick Guenther | Re: Real men don't attack straw men |
| Tanvir | Re: Adobe Flash on OpenBSD |
| Marius ROMAN | 1440x900 resolution problem |
| Gilles Chehade | Re: wpi (4) and 4.4 |
| Johann Baudy | Re: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING |
| Volker Armin Hemmann | build error with 2.6.27.6+reiser4+ehci-hub patch. ERROR: "mii_ethtool_gset" [drive... |
| Patrick Ohly | [RFC PATCH 00/13] hardware time stamping + igb example implementation |
| Laszlo Attila Toth | [PATCHv7 3/5] Interface group: core (netlink) part |
