Re: Processes spinning forever, apparently in lock_timer_base()?

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Matthias Hensler
Date: Thursday, August 9, 2007 - 10:37 am

On Thu, Aug 09, 2007 at 09:55:34AM -0700, Andrew Morton wrote:
ote:
l?

So far not, sorry. The problem did not reoccur since I am running the
kernel with that patch now.


All affected systems have two devices which are cross mounted, eg. it
has /home on disk 1 and /var/spool/imap on disk 2.

On the respectively device there is the mirror partition (mirror of
/home on disk 2 and mirror of /var/spool/imap on disk 1).

Normally the system hangs when running rsync from /var/spool/imap to its
corresponding mirror partition.

We have around 10 different partitions (all in LVM), but so far the hang
mostly started running rsync on the imapspool (which has by far most of
the files on the system).

_However_: we also had this hang while running a backup to an external
FTP server (so only reading the filesystems apart from the usual system
activity).

And: the third system we had this problem on has no IMAP spool at all.
The hang occured while running a backup together with "yum update".

It might be related that all the systems uses these crossmounts over two
different disks. I wasn't not able to reproduce that on my homesystem
with only one disk, either because the activity-pattern was different,
or it really needs two disks to run into that issue.

cate
ement).

Well, all the time I was still connected to one of that machines it was
possible to clear the situation by killing a lot of processes. That is
mostly killing all httpd, smtpd, imapd, amavisd, crond, spamassassin and
rsync processes. Eventually the system responded normal again (and I
could cleanly reboot it). However the final killed process which
resolved the issue was non deterministic so far.

The workload is different on all the servers:

Server 1 processes around 10-20k mails per day but also servers a lot
         of HTTP requests.
Server 2 is only a mailserver processing around 5-10k mails per day.
Server 3 just serves HTTP requests (and a bit DNS).


I have to admit that I never tried to sync in such a case, since mostly
I had only one open SSH session and tried to find the root cause.

So far the problem occured first on server 2 around march or april
without changes to the machine (we have a Changelog for every server:
there was a single kernel update that time, however we reverted back
after the first issue and run a kernel which was stable for several
weeks before and encountered the problem again). In the beginning we
encountered the problem maybe twice a week, getting worse within the
next weeks. Several major updates (kernel and distribution were made)
without resolving the problem.

Since end of april server 1 began showing the same problem (running a
different Fedora version and kernel at that time), first slowly (once a
week or so) then more regular.

Around july we had a hang nearly every day.

Server 3 had the problem only once now, but that server has no high
workload.

We spent a lot of time investigating the issue, but since all servers
use a different hardware, different setups (beside from the crossmounts
with noatime) and even different base systems (Fedora Core 5+6, Fedora
7, different kernels: 2.6.18, .19, .20, .21 and .22) I think that we can
rule out hardware problems. I think the issue might be there some time
now, but is hit more often now since the workload increased a lot over
the last months.

We ruled a lot out over the month (eg. syslog was replaced, many not so
important services were stopped, schedular was changed), without
changes. Just reverting from "noatime" in the fstab to "default" fixed
it reliable so far.

As said I am still running "vmstat 1" and catting of /proc/meminfo just
in case. If there is anything I can do beside that to clearify the
problem I will try to help.

Regards,
Matthias
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: Processes spinning forever, apparently in lock_timer_b ..., Matthias Hensler, (Thu Aug 9, 10:37 am)