Re: Kernel hangs in SMP + VMware environment.

Previous thread: Re: performance "regression" in cfq compared to anticipatory, deadline and noop by Jens Axboe on Tuesday, May 13, 2008 - 11:03 am. (23 messages)

Next thread: Build error: ARM badge4_defconfig by Russell King on Tuesday, May 13, 2008 - 11:58 am. (1 message)
From: Roland
Date: Tuesday, May 13, 2008 - 11:05 am

maybe related to http://bugzilla.kernel.org/show_bug.cgi?id=9834  ?

you say "recent" , so this does happen from 2.6.21 to 2.6.26rc2 ?
does that happen only on a dedicated vmware box, or on different ones?
vmware-tools active? ->stop -> different ?

could you provide some more information about your hardware/vmware 
environment ?
does that happen on esx or on hosted products (workstation, server, 
player..) ?

regards
roland




List:       linux-kernel
Subject:    Kernel hangs in SMP + VMware environment.
From:       Tetsuo Handa <penguin-kernel () I-love ! SAKURA ! ne ! jp>
Date:       2008-05-12 21:41:34
Message-ID: 200805130641.CDG56299.QStFOLVOOJFHMF () I-love ! SAKURA ! ne ! 
jp
[Download message RAW]

I'm experiencing hang up problem with recent kernels in VMware environment.

Here are two examples.

http://I-love.SAKURA.ne.jp/tmp/messages.1 (203kB)
http://I-love.SAKURA.ne.jp/tmp/messages.2 (1.7MB)

The messages.1 is a log when "tar" stopped processing
while extracting a .tar.bz2 file by
"rpmbuild -bb --target i586 --with baseonly kernel.spec".
I got this log in runlevel 3 of Fedora 8.

The messages.2 is a log when the compiler processes (e.g. "cc1")
seem to be hanged up (no compiler messages appear for minutes,
which unlikely happen).
I got this log in runlevel 1 of Fedora 8
by "rpmbuild -bb --target i586 --with baseonly kernel.spec"
after starting rsyslog and stopping anacron.

I experience this problem in many distro (e.g. Fedora, Ubuntu, SuSE),
which use kenel (I think) around 2.6.21 and later.

I experience this problem only in VMware.
I have never experienced this problem in native environment.

I experience this problem only when I assign 2 CPUs to VMware.
I have never experienced this problem with 1 CPU.

May be something scheduler related in SMP + VMware environment.

The a.out process in the log files are http://lkml.org/lkml/2008/5/12/130
What other information should I dump for identifying the location of hang 
up?

Regards.

--

From: Tetsuo Handa
Date: Wednesday, May 14, 2008 - 4:00 am

Hello.

Thank you for URL.
I don't know exact version, but I don't experience this problem

Hardware: ThinkPad X60 (Intel Core 2 Duo, 2048MB RAM, No swap partition)
VMware host environment: CentOS 5.1 (x86_64)
VMware version: VMware Workstation 6.0.2 (x86_64)
VMware guest environment: many distro using recent kernels (all i386)

I don't have ESX server environment.



Today, I tried to reproduce this problem using 2.6.24.5-85.fc8 kernel and
I got 2 patterns.


http://I-love.SAKURA.ne.jp/tmp/hangup-3.png      (10kB)
http://I-love.SAKURA.ne.jp/tmp/messages-3.txt   (174kB)

hangup-3.png is the screenshot of hang up and messages-3.txt is the sysrq logs.
Funny thing is that "tar" process sleeps for minutes at blk_remove_plug()
(while "tar" finishes within a minute if 1 CPU).


http://I-love.SAKURA.ne.jp/tmp/dmesg-4.txt      (120kB)

dmesg-4.txt is a partial output of "dmesg".
Since rsyslog sometimes cannot save logs to /var/log/messages by some reason,
I tried to directly save from /proc/kmsg using "a.out",
but "a.out" couldn't save logs neither.
Funny thing is that "a.out" process sleeps for minutes at getnstimeofday().
The source code of "a.out" is

  #include <stdio.h>
  #include <unistd.h>
  #include <time.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <string.h>
  
  int main(int argc, char *argv[]) {
  	FILE *fp = fopen("/proc/sys/kernel/sysrq", "w");
  	if (!fp) return 1;
  	fprintf(fp, "1\n");
  	fclose(fp);
  	fp = fopen("/proc/sysrq-trigger", "w");
  	if (!fp) return 1;
  	if (fork() == 0) {
  		int fd_r = open("/proc/kmsg", O_RDONLY);
  		int fd_w = open("/root/messages", O_WRONLY | O_TRUNC | O_CREAT, 0600);
  		char buffer[4096];
  		char timebuf[80];
  		memset(timebuf, 0, sizeof(timebuf));
  		memset(buffer, 0, sizeof(buffer));
  		while (1) {
  			const int len = read(fd_r, buffer, sizeof(buffer));
  			static time_t prev = 0;
  			const time_t now = time(NULL);
  			if (now != prev) {
  				static int ...
Previous thread: Re: performance "regression" in cfq compared to anticipatory, deadline and noop by Jens Axboe on Tuesday, May 13, 2008 - 11:03 am. (23 messages)

Next thread: Build error: ARM badge4_defconfig by Russell King on Tuesday, May 13, 2008 - 11:58 am. (1 message)