2.6.1 I/O Very poor

Submitted by Curt
on January 29, 2004 - 12:29am

Not trying to be inflammatory, its just what I was horrified to observe- the filesystem performance under my newly installed 2.6.1 kernal is, well, horrific.

Long story short- I have a simple test program you can try, to see for yourself: http://northarc.com/~curt/bigfile_test.C compile with: gcc -o bf -lpthread bigfile_test.C

Test platform is- 512M of RAM, athalon 1.33Ghz and some generic IBM drive, ext2 (no journaling) results at the bottom. Its a vanilla installation of RedHat 7.2

Long story slightly longer. HELP!!!!!!! I am the lead developer at one of the "enterprise software vendors who has been clamboring for the new threading model" (highwinds-software, UseNet server software)

I have been on pins and needles waiting for a stable release so I could test/certify our software on it; our customers are screaming for it and I want to give it to them, but the performance of our software on this kernel was pathetic. Stalls, halts, terrible.

I finnaly narrowed it down to the disk subsystem, and the test program shows the meat of it. When there is massive contention for a file, or just heavy (VERY heavy) volume, the 2.6.1 kernel (presumably the filesystem portion) falls over dead. The test program doesn't show death, but could by just upping the thrasher count a bit.

Where do I go with this? Anyone have any idea who I can take this test program to? I have been telling our customers for over a year now "Don't worry, Linux will be able to rock with the new threading model" and then.. this.. I want to be constructive here. Any advice would be appreciated, I'm new to the Linux community per se, though I've been developing on it for years.

RESULTS:

Changing ONLY the kernel and rebooting, I ran the program twice to make sure any buffers were flushed. This had a dramatic effect, as the second (and all subsequent attempts, these results are representative) were consistenty better, although the 2.6.1 implementation was still worse.

This test program accurately models the largest job our UseNet software does, randomly accessing ENORMOUS files. It creates a 2G file and then accesses it with and without contention.

[root|/usr/local/tornado_be/bin]$ uname -a
Linux professor.highwinds-software.com 2.6.1 #0 SMP Wed Jan 28 01:24:07 EST 2004 i686 unknown

bytes to write[2000027648]
time [227]
1000 random 8192-byte accesses (single threaded)
time [12]
1000 random 2048-byte accesses (single threaded)
time [11]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [65]
1000 random 2048-byte accesses (with thrashers)
time [56]
[curt|/usr/local/test]$ ./bf bigfile
bytes to write[0]
time [0]
1000 random 8192-byte accesses (single threaded)
time [10]
1000 random 2048-byte accesses (single threaded)
time [10]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [57]
1000 random 2048-byte accesses (with thrashers)
time [48]

[curt|/usr/local/test]$ uname -a
Linux professor.highwinds-software.com 2.4.7-10 #1 Thu Sep 6 16:46:36 EDT 2001 i686 unknown

bytes to write[2000027648]
time[139]
1000 random 8192-byte accesses (single threaded)
time[33]
1000 random 2048-byte accesses (single threaded)
time[13]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time[42]
1000 random 2048-byte accesses (with thrashers)
time[50]
[curt|/usr/local/test]$ ./bf bigfile
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[10]
1000 random 2048-byte accesses (single threaded)
time[9]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time[44]
1000 random 2048-byte accesses (with thrashers)
time[40]

Elevator

Con Kolivas
on
January 29, 2004 - 12:38am

Yours is probably the sort of load that might benefit from the deadline elevator. Try booting with the option "elevator=deadline". Oh and make sure you really do have full support for the chipset properly compiled in and dma working (some chipset names have changed so you may have missed it).

Okay I'll make sure the DMA w

Curt
on
January 29, 2004 - 6:02am

Okay I'll make sure the DMA works and I'll try the kernel argument, but you didn't mention wether or not the test program has the same problem with your system?

-Curt

Indeed

Con Kolivas
on
January 29, 2004 - 6:30am

No I didn't because I didn't test it :-P. I do have a problem with streaming writes from capturing media dropping lots of frames despite very capable hardware that only goes away with elevator=deadline, and akpm has often said that for seeky loads anticipatory will never be as good as deadline; you seem to be doing both.

I've done a bit of googling b

Curt
on
January 29, 2004 - 9:24am

I've done a bit of googling but can't find a good source for this, is there a good place to read up on the linux elevator code and 2.6.1 options/tweaks in general?

I have a feeling that the 2.6 kernel comes optimized/set for a different kind of application, I just need to know how to get it back to at least as good as it was in 2.4, then (presumably) with the threading "fixed" we should see an improvement.

Thanks for the help, anyone else?

-Curt

Documentation Lags

Con Kolivas
on
January 29, 2004 - 5:04pm

Documentation will lag behind development a great deal. All the information I use comes from reading the linux kernel mailing list itself and searching that will give you more info than any documentation online or printed. This site itself gives a good summary of interesting developments with links to lkml archives, and there are numerous lkml archives online worth searching.

2.6.1 I/O

Anonymous
on
January 29, 2004 - 8:25pm

I tried it. I didn't see the big difference in performance that you did. I have a 1GH athalon and a normal IDE drive with ext3 file system. The kernels were 2.4.20-8 and 2.6.1. Both were compiled without SMP support. It looks like your 2.6.1 had SMP, don't know how that would effect it, but it would add overhead.

2.6.1 kernel
bytes to write[2000027648]
time[60]
1000 random 8192-byte accesses (single threaded)
time[14]
1000 random 2048-byte accesses (single threaded)
time [11]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [49]
1000 random 2048-byte accesses (with thrashers)
time [49]
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[10]
1000 random 2048-byte accesses (single threaded)
time [10]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [51]
1000 random 2048-byte accesses (with thrashers)
time [54]

2.4.20-8 kernel
bytes to write[2000027648]
time[70]
1000 random 8192-byte accesses (single threaded)
time[13]
1000 random 2048-byte accesses (single threaded)
time [10]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [48]
1000 random 2048-byte accesses (with thrashers)
time [48]
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[10]
1000 random 2048-byte accesses (single threaded)
time [11]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [50]
1000 random 2048-byte accesses (with thrashers)
time [50]

Further comments

Con Kolivas
on
January 30, 2004 - 12:54am

I read your thread on lkml. I see you noticed it was probably a dma problem mostly after all. This is _very_ common going to 2.6 as the cause of regressions.

Good to see you spoke to the authorities too. Andrew & Nick are both good guys and will give you the attention you need.

yup. I'm just a poor softw

Curt
on
January 30, 2004 - 9:50am

yup.

I'm just a poor software vfendor trying to make sure our stuff works on 2.6.x so we can pass on the GoodNews(tm) to our customers.

After some tweaking and discussing I think we.re not quite ready to open the floodgates, but its a lot warmer-fuzzier than it was two days ago.

Thanks for the help and direction!

-Curt

I/O 2.6/2.4

hackus
on
February 1, 2004 - 10:21am

I have given your comments some thought, and I find that first of all, the code you listed, was c++ code, not c code.

For those of you following with limited interest in the topic matter, the correct compilation sequence for the c++ code in question requires a small modification from:

gcc -o bf -lpthread bigfile_test.C

to

g++ -o bf -lpthread bigfile_test.C

To which you give bf a file name to gain access to your local file system, and it happily creates all sorts of thread anarchy.

:-)

Now then, with regards to the following statement you made:

"I have been on pins and needles waiting for a stable release so I could test/certify our software on it; our customers are screaming for it and I want to give it to them, but the performance of our software on this kernel was pathetic. Stalls, halts, terrible."

First of all, you do realize that it is not the job of the OS to compensate for poorly written software, which is what your product is if it cannot detect a deadlock situation with either CPU/IO/Memory contention, causing bad user mojo.

Ok, well I ran your "test" program on two different systems.

2.4.23: (Server)
SCSI Subsystem

bytes to write[2000027648]
time[82]
1000 random 8192-byte accesses (single threaded)
time[34]
1000 random 2048-byte accesses (single threaded)
time [3]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [7]
1000 random 2048-byte accesses (with thrashers)
time [6]

2.4.23:
IDE Bad Subsystem (Laptop)
Machine Profile:

bytes to write[2000027648]
time[110]
1000 random 8192-byte accesses (single threaded)
time[22]
1000 random 2048-byte accesses (single threaded)
time [9]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [47]
1000 random 2048-byte accesses (with thrashers)
time [45]

2.6.1:

IDE Bad Subsystem (Laptop)

bytes to write[2000027648]
time[99]
1000 random 8192-byte accesses (single threaded)
time[9]
1000 random 2048-byte accesses (single threaded)
time [11]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [58]
1000 random 2048-byte accesses (with thrashers)
time [57]

I could only run the test program on my laptop, with 2.6.1. But the results are similair to yours, showing an increase time for random access on 2.6.1 kernels using an IDE subsystem.

You didn't mention what sort of IO subsystem you were using, but I will bet you a steak dinner at the Roadhouse here in Green Bay Wisconsin that is was an IDE system.

However, the amount of increase in time around 15% could push a deadlock issue with your threading design.

So what to do about it? Here is what I recommend:

1) I would try and insure that the recommended hardware IO include a SCSI subsystem. The server machine "test" run clearly shows that there is a dramatic difference in IO with SCSI and IDE components. The server machine in question easily out paced my state of the art modern IDE machine, as well as yours too. Oh, and by the way, the server machine was a Dual processor PIII 600 system on a Intel BX chipset with PCI Adaptec 2940 board. (i.e. the machine is 4 years old.)

This advice is not something that is limited to Linux either.

IDE will never scale in IO's like SCSI can using multi channel I/O.

2)Your test program has flaws. I hope it is not representitive of your application. The reason why I say this is because even the most disk intensive applications, such as Oracle/MySQL are designed to properly handle multi-channel I/O.

In other words, if you expect an operating system kernel to resolve issues like this, your living on Fantasy Island and the plane just arrived to take you back to reality...Tatoo, will you help him aboard? Thank You...

The reason why your test program raises red flags in this area is because your threads were spread accross a single I/O channel in your test program. (i.e. all the threads were using the same file handle.)

You never do this in the real world, and even the Oracle admin who doesn't know a single line of C/C++ systems threading code knows that you put your database files with high contention on seperate drives, which are on seperate device channels.

Otherwise, surprise, as you ran your program you probably found TOP listing each thread as having aroun 80% idle time waiting for IO to finish.

I suggest rewriting the test program to spread the load over more than 1 file, or perhaps rewriting your application after wards to fit the test program!

3) Finally, everyone knows 2.6.1 and IDE subsystem in general has its problems. It will get better, but not by much. This was the what, 3rd complete and total rewrite of the IDE system layers with 2.6? (Jens Axboe) Most of the big issues have been addressed, but it never ceases to amaze me that people think IDE is the way to go and SCSI is just over priced and not really needed.

I can assure you, that is not the case. Whether it be a Windows server or a Linux server, you have to properly design your server systems under the umbrella of a modern software architecture.

In your particular case, that means it is a good idea to have system programmers that understand how to build networks and server systems.

---

Now my diatribe. I love my diatribes, and I hope you do to, so here it comes...

These problems and others result from software engineers that think just because they have degrees, they can write software for any application area. My advice to anyone, who wants to write Linux server/Kernel code or even end user application database software of any kind in Java, C etc: Do everyone a favor and spend 3-4 years being a network/server application administrator on a non trivial network (i.e. > 500 nodes)....

BEFORE YOU EVEN WRITE A SINGLE LINE OF CODE.

When it comes time to write code, you will KNOW what to look for, and what the problems are before you begin writing code.

Why does this help? Simply because you will spend a lot of time administrating and taking user complaints and tweaking hardware to get the most out of your badly written database application your company depends on, which you administrate.

:-)

-Hack

PS: Did I mention you spent WAY too much money for such application software? Oh, sorry.

In certain cases, like in the

Anonymous
on
July 15, 2004 - 7:55am

In certain cases, like in the example test, it's pretty ok to compile C++ code with gcc. g++ links with libstdc++ while gcc won't. Less dependency.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.