I'm working on an SQL database which supports sustain inserts of over 1 million rows per second with one index. The problem is that I'm filling 2TB's in about 3 hours and I don't want to create a lot of individual little 2TB files. :-) What options exist in current CentOS or RedHat release which I believe are still 2.6.9 based? I've heard of ext4 but it is still experimental and only in 2.6.19 which I think have to build myself. Selling a system with my hacked OS vs. an off the shelf commercial RedHat raises issues with many customers. Beside oprofile seems to be broken in 2.6.19 and I need to profile to get to 1.5 millions or higher inserts per second. Another feature I'm looking for in any of the endless competing "and available" filesystems is extent based allocation. Dropping a 2TB file on ext2 takes a half hour. Do any of the current ones deal with that? While I know that some consider raw disk to be depreciated can a raw device be as large as the largest supported partition size for a raid device?
Thanks.
google helps
It's fairly easy to get the specs for a particular FS and magazines like Linux Journal have printed FS comparisons. Of course if you want to play Russian roulette you can look at the Wikipedia entry:
http://en.wikipedia.org/wiki/Comparison_of_file_systems
At least you can get the names of prospective file systems and then check the facts. Since your files are multi-terabyte you want a filesystem with an exabyte (EiB) or petabyte (PiB) file size limit (peta is smaller than exa). Since you're already playing in the TiB range I just assume you have LVM or something similar to make a logical volume span several discs or RAIDs.
Try XFS and / or wait for RHES5
XFS was built and debugged for very large filesystems. It has a write rate of > 1GB/Sec
onto 16 current sata drives (SAS or sata controllers, raid0). 2.6.9 is older than Moses
and about as spry. You can wait a few months and use RH5 which has a kernel from this century
at least. Deleting a 4TB file on XFS is nearly a zero time operation (not hours with ext3).
I would not run a kernel under 2.6.18.6. 2.6.19 has issues, 2.6.19.* work well. 2.6.20-rc5 has performance on par with 2.6.18.6 as far as XFS goes. The rpms are freely available (mkfs.xfs, xfsdump, dmapi ...).
Of course I use a reasonably configured 4 cpu opteron, with 16GB or more... If you are
doing DB work at those insert rates you likely are doing better than a Xeon with its 600MB/Sec northbridge and 32 bit address (oops, 36 bit address - you're not playing with a full DEC :-)
Just my $0.02
Thanks
Thanks. I'll look into xfs on the web. It doesn't seem to be there on my CentOS using 2.6.9 and "yum" doesn't show it as one of the yum'able packages either. I think I've seen a download for it on the web.
Do you have any personal experience of a high end pci/express controller and disk vendors for maximum sustain write rates? We are using a 3ware 16 port controller. The 8 way hardware raid 0 I created on it is doing 550MB's per second.
We are running a 4 CPU/8 core AMD box with 32GB's of memory.
I don't want to create a lot
Why not?
The why
At the point I add the feature to fragment/partition the table by some key or round robin the data I'd consider this. If this was a simple 'C' program I was using to write alot of data I'd do that. But the project to add data partitioning of a table to a commercial SQL database engine is at least a 6 man month project. I wasn't expecting the general limit of a file's size on off the shelf versions of Linux to be only 2X the size of a single currently available disk drive.
on ext2/3 partition, XFS and
on ext2/3 partition, XFS and reseirfs support bigger file than this.
increase sector size
mk2efs give the -T option for large file. Maybe you could break the 2TB limit that way.
XFS rpms, performance #s
-rw-r--r-- 1 bshands dssi 43765 May 16 2006 dmapi-2.2.1-1.x86_64.rpm
-rw-r--r-- 1 bshands dssi 23456 May 16 2006 dmapi-devel-2.2.1-1.x86_64.rpm
-rw-r--r-- 1 bshands dssi 348147 May 16 2006 xfsdump-2.2.30-1.x86_64.rpm
-rw-r--r-- 1 bshands dssi 2355200 Oct 17 12:41 xfsdump_2.2.42-1.tar
-rw-r--r-- 1 bshands dssi 1016392 May 16 2006 xfsprogs-2.7.3-1.x86_64.rpm
-rw-r--r-- 1 bshands dssi 4495360 Oct 17 12:42 xfsprogs_2.8.11-1.tar
-rw-r--r-- 1 bshands dssi 277710 May 16 2006 xfsprogs-devel-2.7.3-1.x86_64.rpm
913238abb3da18109a29783aeff170cc dmapi-2.2.1-1.x86_64.rpm
7885e1785b237ee7d3fea84a3126caed dmapi-devel-2.2.1-1.x86_64.rpm
83b138168237196a6715c86bd409e4df xfsdump-2.2.30-1.x86_64.rpm
f3d26b84f3a81fbc668af1beaaa3ddb1 xfsprogs-2.7.3-1.x86_64.rpm
fbead358ab08fc1259f015e5d43f261c xfsprogs-devel-2.7.3-1.x86_64.rpm
I have used *MOST* sata and SAS controllers, 8 and 16 port units, pci-x and pci-e
RR2240, RR2220, RR2320 from highpoint. See the linuxmafia web page on sata controllers,
and sas controllers
http://linuxmafia.com/faq/Hardware/sata.html#hp
I found the RR2320 was the best sata controller, though they have a binary blob driver :-(
this makes it harder to use current kernels.
For SAS, which tunnels SATA-II, I've tried the Adaptec 48300 (broken, don't use),
and the LSI 8408E/8480E - works great, except on super micro H8DC8, where you MUST
turn off read ahead ro the pci-e locks up.
The LSI8888ELP is currently under test :-)
If use are a little smarter about writing - see me offline - You can sustain 1GB/Sec over the first 2 TB. Then it is down to 900MB/Sec. Use better drives (7200.10 500GB or 750GB)
and the write rate goes up. Read rates are > 1.2GB/Sec.
Watch the posix_fadvise() note on the linuxmafia site.
Beware of controllers that use the Intel IOP. They have small buffer sizes, and limit throughput.
You might consider a TSTCOM ESR-316 chassis or the ESR-324 chassis. Native SAS/Sata support.
Berkley
I cannot believe people are still recommending XFS
If you have a power failure with XFS, the open file becomes binary zeroes (Google will confirm this). I guess 2TB file will go bye bye.
SGI, which went bankrupt, "supports" XFS file system. When you report binary zero problem -- "should not happen!".
The solution I like, Reiserfs, currently has Hans Reiser on trial for murdering his wife. Still it is much better than XFS.
The Linux community is very strange, that is why no progress is made. Probably 75% think Linux is only good for a web server, something that any idiot can setup in a few minutes ("apt-get install apache2 php5 mysql" -- done!). Only a small fraction use Linux as a desktop, preferring M$ for that use. This same kind of "community" will be recommending XFS 20 years from now -- they are frozen in time, always recommending same thing as in 1998. However, average age of Linux user is probably about 60, maybe they will be dead rather than recommending XFS in twenty years.
Sorry for not posting in Linux flames.
you're strange ...
If you have a very large disk array and are susceptible to power failures I don't know what the hell you're doing in the computer industry. You are saying that XFS will fail in conditions which shouldn't happen (unless the system designer is a complete moron). I have no idea what you mean by "Google will confirm this" - do you mean Google techs will say that this is a current problem or do you mean that a Google search will claim it is a problem? Either way, it is not possible to make that amount of data disappear from disk. If, after taking all reasonable measures, you somehow managed to lose power while doing a write to a database, you can certainly recover all the previous data even though a bug in a filesystem (assuming this bug exists) may have caused incorrect data to be written and makes the file seem to disappear. It really takes an idiot to lose the data.
Performance or a larger rock to crawl under?
The original poster wanted performance that ext3 could not deliver. The larger cluster file systems deliver performance. If you don't invest in your infrastructure, even with ext3, you risk losing what you have not backed up at least 3 times. This is much like FAT-16 -> FAT-32 -> NTFS... Who would have thought 10MB was not going to be enough disk space? You should look at the challenges that the government poses to file system creators. 1 Million files in a directory over NFS? An fsck run of billions of files? How long does it take to *delete* those files?
You really have to pick your filesystem to suit your use. Multi-terabyte files don't work on 20 year old FAT file systems. Have you ever seen a 2TB file under WinDoze?
Come on, quit flaming because it is fun! More signal, less noise. I can buy 1TB Sata disks now. Put 16 of them into a sata or sas raid. How long does mkfs take?
Nap time again.
If you are working with such
If you are working with such large files, why not look at a parallel FS such as GPFS or Polyserve which shine with large block I/O workloads.
what's wrong with small files
I'm working on an SQL database which supports sustain inserts of over 1 million rows per second with one index. The problem is that I'm filling 2TB's in about 3 hours and I don't want to create a lot of individual little 2TB files.
I'll admit to ignorance about the world of high-performance databases.
But I have to ask-- what's so bad about small files? It's a little bit of an administrative headache on your end, but surely not that bad.
It's probably a bad idea to depend on any bleeding-edge features in the kernel or filesystem.
No doubt you're already running something like RAID 5 or LVM to get a volume large enough to even bring up this issue. Isn't that enough adventure for now?