login
Header Space

 
 

Looking for ideas on stability research for Sun X4100 M2

Score:
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: misc@openbsd.org <misc@...>
Date: Saturday, November 24, 2007 - 11:05 pm

Hi,

You may have seen a few posts from me on this box. I continue to try to 
isolate the problem as much as possible with it and it's now narrow to a 
more specific setup in the current kernel, but still this box WILL NOT 
be stable what so ever if use with the amd64.mp kernel.

I am running out of ideas to narrow it more, so if anyone may have 
suggestions as to what I could look for, I would appreciate it.

I start to think that it might be a run out of buffer space when writing 
to the drives, I am not sure if that's logical or not.

However, I find a few ways to make it more stable, but not crash free.

All may be 100% related to the writing speed to the SAS drives.

To proof this point, or to discard it, I would like to find a way to 
really control the writing speed and at the same time be able to monitor 
the system variable, buffer, or what ever make sense to isolate it more.

Why I am saying this is because if I do transfer slowly, by using slow 
old servers over the network, I can transfer big files to that Sun box, 
but as soon as I increase the writing speed to the drive, I reach the 
point when I crash it.

It is ALWAYS writing only that will crash it. Reading as badly as you 
want, so far anyway, just doesn't trigger the problem.

I proof that by mounting partitions, RO and noatime in fstab. Yes, I 
need both, then noatime is important in my tests anyway.

I can create a 10GB file and then do cp /var/test /dev/null and I will 
transfer that file at 32MB/sec and it will do it and not crash. I can do 
multiple read, etc.

However, as soon as ANY write is done as small as it might be when the 
drive is very busy, or I guess may be the driver buffer, or control, or 
what not that I am trying to isolate is full, or loaded, then a simple 
small write to drive like echo 'test' >/tmp/test will crash that box 
every time. Even a simple ssh access to that box, when it will try to 
update the /var/log/authlog, it will go south.

I also was able to increase the writing speed to the drive before it 
crash, or the size I can transfer before it crash if I also disable the 
USB virtual support for the SCSI cdrom that is provided by the BIOS on 
that box.

So, I am looking at ideas where I could possibly look to come out with 
more details and possibly fix that box. It's much better then it was 
with 4.2 release by far as many issues where fix, including auto 
negotiation of network card, etc on that box. The short of it is that 
box is not usable at all when you load the amd64.mp kernel on it, but is 
now finally stable, or sure more resistant anyway when use the single 
processor amd64, and so far, I haven't been able to crash it yet in more 
then two months test if I use the i386 single or mp kernel.

But as far as I can see, there is still bug(s) to be found in the 
amd64.mp kernel and I am looking for ideas as to narrow it down more.

I am running out of trucks so far and need more ideas.

Amy be some sysctl variable can be try to test my theory, may be not.

But based on my tests, it looks like that it might be some kind of 
buffer that runs out, based on the fact that slower writing doesn't 
crash it, and I want to proof or deny that, but do not know how, other 
then doing it the way I did using slow speed computers, or put port 
speed at 10mb as an example, etc. But that may also be a very stupid 
idea, however looks possible.

Most likely there is something in the drive that crash it, however, my 
understanding is that, the drive here is the same regardless if use in 
the single processor, or mp kernel.

I try to see what might be different in kernel between the two that 
might affect this, but I have to admit that at this point, I am over my 
head to find witch part is the most logical part to look at.

So, an suggestions for testing that anyone might have would be welcome.

I totally give up on using the amd64.mp kernel on these boxes and I am 
happily using the i386.mp, but I still would love to find the final 
answers as almost all bugs for that box in the last 6 months have been 
resolved. It's much better then it was, but not home free yet.

Best,

Daniel
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Looking for ideas on stability research for Sun X4100 M2, Daniel Ouellet, (Sat Nov 24, 11:05 pm)
Re: Looking for ideas on stability research for Sun X4100 M2, Daniel Ouellet, (Sun Nov 25, 12:12 am)
speck-geostationary