CRC32C big endian bugs...

Previous thread: Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS by Oliver Pinter on Tuesday, February 5, 2008 - 2:48 pm. (8 messages)

Next thread: Re: [RFC] ext3 freeze feature by Takashi Sato on Wednesday, February 6, 2008 - 6:05 pm. (1 message)
From: Chris Mason
Date: Wednesday, February 6, 2008 - 10:00 am

Hello everyone,

I wasn't planning on releasing v0.12 yet, and it was supposed to have some 
initial support for multiple devices.  But, I have made a number of 
performance fixes and small bug fixes, and I wanted to get them out there 
before the (destabilizing) work on multiple-devices took over.

So, here's v0.12.  It comes with a shiny new disk format (sorry), but the gain 
is dramatically better random writes to existing files.  In testing here, the 
random write phase of tiobench went from 1MB/s to 30MB/s.  The fix was to 
change the way back references for file extents were hashed.

Other changes:

Insert and delete multiple items at once in the btree where possible.  Back 
references added more tree balances, and it showed up in a few benchmarks.  
With v0.12, backrefs have no real impact on performance.

Optimize bio end_io routines.  Btrfs was spending way too much CPU time in the 
bio end_io routines, leading to lock contention and other problems.

Optimize read ahead during transaction commit.  The old code was trying to 
read far too much at once, which made the end_io problems really stand out.

mount -o ssd option, which clusters file data writes together regardless of 
the directory the files belong to.  There are a number of other performance 
tweaks for SSD, aimed at clustering metadata and data writes to better take 
advantage of the hardware.

mount -o max_inline=size option, to override the default max inline file data 
size (default is 8k).  Any value up to the leaf size is allowed (default 
16k).

Simple -ENOSPC handling.  Emphasis on simple, but it prevents accidentally 
filling the disk most of the time.  With enough threads/procs banging on 
things, you can still easily crash the box.

-chris
-

From: David Miller
Date: Sunday, February 10, 2008 - 6:12 pm

From: Chris Mason <chris.mason@oracle.com>

I couldn't even make a filesystem on sparc64 without the following
patch.

The first problem is that these SETGET macros lose typing information,
and therefore can't see the 'packed' attribute and therefore take
unaligned access SIGBUS signals on sparc64 when trying to derefernce
the member.

The next problem is a similar issue in btrfs_name_hash().  This gets
passed things like &key.offset which is a member of a packed
structure, losing this packed'ness information btrfs_name_hash()
performs a potentially unaligned memory access, again resulting in a
SIGBUS.

This function never returns an error, so the simplest fix was to
return the hash value which avoids all of the issues.  In attempting
other schemes to fix this, I found it very difficult to give gcc
a packed attribute for that "u64 *" argument other than to create
some new pseudo structure which would have been ugly.

Similar code lives in the btrfs kernel code too, I'll try to get a
partition at least mounted and working minimally and if successful
I'll send you patches for that too.

diff -up --recursive --new-file vanilla/btrfs-progs-0.12/ctree.h btrfs-progs-0.12/ctree.h
--- vanilla/btrfs-progs-0.12/ctree.h	2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/ctree.h	2008-02-10 16:53:24.000000000 -0800
@@ -451,18 +451,16 @@ static inline void btrfs_set_##name(stru
 static inline u##bits btrfs_##name(struct extent_buffer *eb,		\
 				   type *s)				\
 {									\
-	unsigned long offset = (unsigned long)s +			\
-				offsetof(type, member);			\
-	__le##bits *tmp = (__le##bits *)(eb->data + offset);		\
-	return le##bits##_to_cpu(*tmp);					\
+	unsigned long offset = (unsigned long)s;			\
+	type *p = (type *) (eb->data + offset);				\
+	return le##bits##_to_cpu(p->member);				\
 }									\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 				    type *s, u##bits val)		\
 {									\
-	unsigned long offset = (unsigned long)s ...
From: Chris Mason
Date: Monday, February 11, 2008 - 6:42 am

Many thanks, I clearly didn't put enough thought into the unaligned access 

The kernel is actually worse, because the set/get macros are more complex.  
Some live in ctree.h like in the progs, but the nasty ones live in 
struct-funcs.c

-chris
-

From: David Miller
Date: Monday, February 11, 2008 - 11:43 pm

From: Chris Mason <chris.mason@oracle.com>

This is really problematic, because you've got these things called
"btrfs_item_ptr()" which really isn't a pointer, it's a relative
'unsigned long' offset cast to a pointer.  The source of this
seems to be btrfs_leaf_data().

And then those things get passed down into the SETGET functions!

Then deeper down we have terribly inconsistent things like
btrfs_item_nr_offset() and
btrfs_item_offset_nr().

Sigh... I'll see what I can do.
-

From: Chris Mason
Date: Tuesday, February 12, 2008 - 6:43 am

Explaining it won't make it pretty, but at least I can tell you what the code 
does.

This is all part of the btrfs code that supports tree block sizes larger than 
a page.  The extent_buffer code (extent_io.c) provides a read/write api into 
an extent_buffer based on offsets from the start of the multi-page buffer.  
That's where the relative unsigned long comes from.

The part where I cast it to pointers is me trying to maintain type checking 
throughout the code.  The pointers themselves are useless, they need to be 
matched with an extent_buffer to actually get to the bytes.

There are a few parts where the SETGET funcs are open coded, mostly in very 

Btree blocks have the offset of the item header from the start of the block 

Thanks

-chris
-

From: David Miller
Date: Tuesday, February 12, 2008 - 12:21 am

Filesystems like ext2 put their superblock 1 block into the partition
in order to avoid overwriting disk labels and other uglies.  UFS does
this too, as do several others.  One of the few exceptions I've been
able to find is XFS.

This is a real issue on sparc where the default sun disk labels
created use an initial partition where block zero aliases the disk
label.  It took me a few iterations before I figured out why every
btrfs make would zero out my disk label :-/

-

From: David Miller
Date: Tuesday, February 12, 2008 - 1:11 am

From: David Miller <davem@davemloft.net>

Actually it seems this is only a problem with mkfs.btrfs, it clears
out the first 64 4K chunks of the disk for whatever reason.

The following patch cures the disk label spamming problem for me:

--- vanilla/btrfs-progs-0.12/mkfs.c	2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/mkfs.c	2008-02-12 00:07:43.000000000 -0800
@@ -210,7 +210,8 @@ int main(int ac, char **av)
 		exit(1);
 	}
 	memset(buf, 0, sectorsize);
-	for(i = 0; i < 64; i++) {
+	lseek(fd, BTRFS_SUPER_INFO_OFFSET, SEEK_SET);
+	for(i = BTRFS_SUPER_INFO_OFFSET / sectorsize; i < 64; i++) {
 		ret = write(fd, buf, sectorsize);
 		if (ret != sectorsize) {
 			fprintf(stderr, "unable to zero fill device\n");
-

From: Chris Mason
Date: Tuesday, February 12, 2008 - 6:49 am

It is a good idea to remove supers from other filesystems.  I also need to add 
zeroing at the end of the device as well.

Looks like I misread the e2fs zeroing code.  It zeros the whole external log 
device, and I assumed it also zero'd out the start of the main FS.

So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

-chris
-

From: Jan Engelhardt
Date: Tuesday, February 12, 2008 - 7:00 am

Something looks wrong here. Why would btrfs need to zero at all?
Superblock at 0, and done. Just like xfs.
(Yes, I had xfs on sparc before, so it's not like you NEED the 
whitespace at the start of a partition.)

-

From: Chris Mason
Date: Tuesday, February 12, 2008 - 7:08 am

I've had requests to move the super down to 64k to make room for bootloaders, 
which may not matter for sparc, but I don't really plan on different 
locations for different arches.

4k aligned is important given that sector sizes are growing.

-chris
-

From: Jan Engelhardt
Date: Tuesday, February 12, 2008 - 7:21 am

In x86, there is even more space for a bootloader (some 28k or so)
even if your partition table is as closely packed as possible,
from 0 to 7e00 IIRC.

For sparc you could have something like

			startlba	endlba	type
	sda1		0		2	1 Boot
	sda2		2		58	3 Whole disk
	sda3		58		90000	83 Linux

and slap the bootloader into "MBR", just like on x86.
Or I am missing something..
-

From: Chris Mason
Date: Tuesday, February 12, 2008 - 7:35 am

It was a request from hpa, and he clearly had something in mind.  He kindly 
offered to review the disk format for bootloaders and other lower level 
issues but I asked him to wait until I firm it up a bit.

From my point of view, 0 is a bad idea because it is very likely to conflict 
with other things.  There are lots of things in the FS that need deep 
thought,and the perfect system to fully use the first 64k of a 1TB filesystem 
isn't quite at the top of my list right now ;)

Regardless of offset, it is a good idea to mop up previous filesystems where 
possible, and a very good idea to align things on some sector boundary.  Even 
going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

-chris
-

From: Jan Engelhardt
Date: Tuesday, February 12, 2008 - 8:04 am

>

I still don't like the idea of btrfs trying to be smarter than a user
who can partition up his system according to
	(a) his likes
	(b) system or hardware requirements or recommendations
to align the superblock to a specific location.

1MB alignment does not always mean 1MB alignment.
Sector 1 begins at 0x7e00 on x86.
And with the maximum CHS geometry (255/63), partitions begin
at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on
a 1048576 boundary.
-

From: Chris Mason
Date: Tuesday, February 12, 2008 - 9:17 am

Will all the users in the world who think about super block location when they 
partition their disks please raise their hands?

The location of the super block needs to be very simple in order for mount and 
friends to find and detect it.  It needs a simple algorithm to try multiple 
locations in case a given copy of the super is corrupt.

Design in this case is a bunch of compromises around other users of the 
hardware, ease of programming, and the benefits in performance or usability 

IO is already aligned on sectors, sometimes we'll have a perfect erasure block 
alignment and sometimes not.  When the location of the super is my biggest 
bottleneck, I'll be a very happy boy.

-chris
-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:38 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

All of your beliefs are unfortunately without the understanding
of restrictions that exist in several partition layouts such
as the Sun disk label one.

You have to start the superblock somewhere other than zero or
else you lose a huge chunk of your disk, and furthermore a
zero based partition is what all of the Sun disk label
creating programs make by default.

"I make a default disk label, I put btrfs or XFS on there, my disk
 label is gone."

Real intuitive.
-

From: Jan Engelhardt
Date: Tuesday, February 12, 2008 - 4:42 pm

x86 MSDOS partition table layout starts counting with sector 1, which
is (not so intuitively) starting at 0x7e00 (and there's no sector 0,
probably for safety). Well, each ptable format with its own quirks.
-

From: David Miller
Date: Tuesday, February 12, 2008 - 6:09 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

You can go update every sparc machine's firmware then, you
can't boot off of a device that doesn't use a sun disk
label.

Your expectations are unrealistic, and it's just simpler to
do something sane and follow the majority of existing practice
by putting the superblock a few blocks into the partition.
-

From: Rene Herman
Date: Tuesday, February 12, 2008 - 6:22 pm

I haven't followed this thread, but in case it matters -- this sounds fairly 
confused.

Not sure what you're saying, but the MSDOS partition table has its root 
table in the very first sector on the disk, at offset 0x1be = 0x200 - 4 * 
sizeof(struct partition) - 2 (that is, 4 entries at the end of that first 
sector, followed by a 2-byte signature).

That 0x7e00 that you are speaking of sounds somewhat like the _memory_ 
address the BIOS loads that first sector to: 0x7c00. It then jumps there to 
start the ball rolling but 0x7c00 is not an on-disk reality or anything.

MS-DOS partition tables are furthermore fully outside the actual partitions 
themselves and as such I believe not all together relevant to the issue? (as 
said, not following along though...)

Rene
-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:35 pm

From: Chris Mason <chris.mason@oracle.com>

Starting at 0 is a bad idea because otherwise you'll waste
significant chunks of your disk on Sparc because of reasons
I've outlined in other replies.

What XFS does is really unfortunate, let's learn from it's
mistake.
-

From: Christoph Hellwig
Date: Wednesday, February 13, 2008 - 12:02 am

I'd rather say what Sun did with their disklabels was rather unfortunate :)
But yeah, new filesystem should cater for it's braindamage because it
doesn't have any kind of runtime cost at all.   XFS was designed for
IRIX back then which had disklabels just like the SUN ones, just without
the braindamage of having the disklabel inside the partition..

-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:34 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

You cannot specify partitions on LBA boundaries on sparc, the Sun disk
label specifies the partition start points in cylinders.
-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:33 pm

From: Chris Mason <chris.mason@oracle.com>

The Sun disk label sits in the first 512 bytes and the boot loader
block sits in the second 512 bytes.

I think leaving even more space is a good idea for several reasons.
-

From: Jeff Garzik
Date: Tuesday, February 12, 2008 - 7:10 pm

Yep.  I chose 32K unused space in the prototype filesystem I wrote [1, 
2.4 era].  I'm pretty sure I got that number from some other filesystem, 
maybe even some NTFS incarnation.  It's just good practice to avoid the 
first and last "chunks" of a partition, FSVO chunk.

	Jeff


[1] http://kernel.org/pub/linux/kernel/people/jgarzik/ibu/
-

From: Szabolcs Szakacsits
Date: Wednesday, February 13, 2008 - 5:51 pm

NTFS superblock (and the partial mirror copy) can be anywhere except in the 
first blocks. That space is where the $BOOT file is placed which contains 
the bootstrap code and the BIOS Paramether Block which includes the NTFS 
signature and describes various filesystem information needed to locate the 
superblock, etc.

Unlike mkfs.xfs which warns since at least 2002 and requires the -f option 
to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy 
them. 

Thank you for the bug report and let's hope the next generation of Sun 
hardwares won't scatter the firmware too into random places inside a 
partition encoded by a fictitious size of disk cylinder.

	Szaka
-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:26 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

You actully do unless you want to lose significant chunks of your disk
space.

The Sun disk label only allows you to specify the start of a partition
in cylinders, so if you want to use a filesystem like XFS you have to
start the partition on cylinder 1 which can be many blocks into the
disk.  That entire first cylinder is completely wasted.

What XFS does by putting the superblock at zero is simply does not
take these kinds of issues into consideration.
-

From: Jan Engelhardt
Date: Tuesday, February 12, 2008 - 4:39 pm

Ok you do have a point there. The GPT users win of course, since it
uses LBA, not cyls, so the number of lost bytes is generally below
a cyl.
On the other hand, the H and S of CHS could be lowered and S increased,
e.g. divide H by 2, divide S by 2, multiply S by 4. This gives a finer

Well it was designed for a different system initially with
a different style of booting.
-

From: David Miller
Date: Tuesday, February 12, 2008 - 6:08 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

That's really not an option when you only have 16-bits in the disk
label for some of these values, which is the case for Sun disk
labels and some others (f.e. BSD).

Jan, please get past this, it's a real issue and we therefore have to
skip the initial 1K or so in order to provide something even
approaching sane.

No modern filesystem should put it's superblock at position zero.

-

From: Bryan Henderson
Date: Tuesday, February 12, 2008 - 6:25 pm

I don't believe a cylinder of wasted disk space is significant.  I don't 
believe a cylinder of disk is worth adding the complexity of sharing a 
partition to a filesystem.  That complexity translates into engineering 
time and mistakes.

Your other arguments for making a hole in the filesystem, based on 
tradition, are more convincing.

--
Bryan Henderson                     IBM Almaden Research Center
San Jose CA                         Filesystems

-

From: David Miller
Date: Tuesday, February 12, 2008 - 4:28 pm

From: Jan Engelhardt <jengelh@computergmbh.de>

So that existing superblocks on the partition won't
be interpreted as correct by other filesystems.  It's

No, we won't do stupid things like that and make an entire
cylinder of our disks unusable.  See my other reply.

-

From: Theodore Tso
Date: Tuesday, February 12, 2008 - 5:45 pm

The reason why we don't put the superblock at 0 is not because it
screws over the sparc, but because on many systems (including x86) the
bootsector is stored at 0.  It's not hard for mke2fs to zap the boot
sector which we do on all architectures *except* sparc, to avoid
nuking the disk label.  (Chris just missed the "#ifndef __sparc //
#define ZAP_BOOTBLOCK // #endif" at the beginning of mke2fs.c)

This is the best of all words; it makes sparc happy; it allows boot
loaders to put the x86 standard initial stage 0 boot loader in the
first 446 bytes of the disk; and by zapping sector 0 on all
architectures except the sparc, it solves the previous filesystem
"ghost traces" detection problem for filesystems like xfs that put the
superblock at 0.

						- Ted
-

From: David Miller
Date: Tuesday, February 12, 2008 - 1:50 pm

From: Chris Mason <chris.mason@oracle.com>

Sure.
-

From: David Miller
Date: Tuesday, February 12, 2008 - 2:23 am

The CRC32C implementation in the btrfs progs is different from the one
in the kernel, so obviously nothing can possibly work on big-endian.

This is getting less and less fun by the minute, I simply wanted to
test btrfs on Niagara :-/

Here is a patch to fix that:

--- vanilla/btrfs-progs-0.12/crc32c.c	2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/crc32c.c	2008-02-12 01:19:33.000000000 -0800
@@ -91,13 +91,11 @@ static const u32 crc32c_table[256] = {
  * crc using table.
  */
 
-u32 crc32c_le(u32 seed, unsigned char const *data, size_t length)
+u32 crc32c_le(u32 crc, unsigned char const *data, size_t length)
 {
-	u32 crc = (__force __u32)(cpu_to_le32(seed));
-	
 	while (length--)
 		crc =
 		    crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
 
-	return le32_to_cpu((__force __le32)crc);
+	return crc;
 }
-

From: David Miller
Date: Tuesday, February 12, 2008 - 2:55 pm

From: Chris Mason <chris.mason@oracle.com>

Any page size larger than 4K will not work with btrfs.  All of the
extent stuff assumes that PAGE_SIZE <= sectorsize.

I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
sparc64 and I was finally able to successfully mount a partition.

With 4K there are zero's in the root tree node header, because it's
extent's location on disk is at a sub-PAGE_SIZE multiple and the
extent code doesn't handle that.

You really need to start validating this stuff on other platforms.
Something that isn't little endian and something that doesn't use 4K
pages.  I'm sure you have some powerpc parts around somewhere. :)

Anyways, here is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.

diff -u --recursive --new-file vanilla/btrfs-0.12/ctree.h btrfs-0.12/ctree.h
--- vanilla/btrfs-0.12/ctree.h	2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/ctree.h	2008-02-10 17:17:49.000000000 -0800
@@ -495,22 +495,17 @@
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)		\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)		\
 {									\
-	char *kaddr = kmap_atomic(eb->first_page, KM_USER0);		\
-	unsigned long offset = offsetof(type, member);			\
-	u##bits res;							\
-	__le##bits *tmp = (__le##bits *)(kaddr + offset);		\
-	res = le##bits##_to_cpu(*tmp);					\
-	kunmap_atomic(kaddr, KM_USER0);					\
+	type *p = kmap_atomic(eb->first_page, KM_USER0);		\
+	u##bits res = le##bits##_to_cpu(p->member);			\
+	kunmap_atomic(p, KM_USER0);					\
 	return res;							\
 }									\
 static inline void btrfs_set_##name(struct extent_buffer *eb,		\
 				    u##bits val)			\
 {									\
-	char *kaddr = kmap_atomic(eb->first_page, KM_USER0);		\
-	unsigned long offset = offsetof(type, member);			\
-	__le##bits *tmp = (__le##bits *)(kaddr + offset);		\
-	*tmp = cpu_to_le##bits(val);					\
-	kunmap_atomic(kaddr, KM_USER0);					\
+	type *p = kmap_atomic(eb->first_page, KM_USER0);		\
+	p->member = ...
From: Chris Mason
Date: Tuesday, February 12, 2008 - 3:03 pm

Grin, I think around v0.4 I grabbed a ppc box for a day and got things 
working.  There has been some churn since then...

My first prio is the newest set of disk format changes, and then I'll sit down 

Many thanks, I'll try these out here and push them into the tree.

-chris
-

Previous thread: Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS by Oliver Pinter on Tuesday, February 5, 2008 - 2:48 pm. (8 messages)

Next thread: Re: [RFC] ext3 freeze feature by Takashi Sato on Wednesday, February 6, 2008 - 6:05 pm. (1 message)