Hi,
(I'm not subscribed to the list, please CC me.)
in our software three processes are using several pthread mutexes.
Sometimes a process hangs inside pthread_mutex_lock even though the mutex is
not locked. I can tell it's not locked because another process is still
running and locking and unlocking the mutex.
When I connect with gdb to the hanging process I find:
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xf7d5e7a9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0xf7d59c75 in _L_lock_288 () from /lib/libpthread.so.0
#3 0xf7d596c5 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0xf7e444b6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x080fb338 in <myfunction> (mutexP=0x70f4165c)
And the mutexP looks like this:
$1 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 1,
__nusers = 0, {__spins = 0, __list = {__next = 0x0}}},
__size = '\0' <repeats 12 times>, "\001\000\000\000\000\000\000\000\000\000\00
0", __align = 0}
I guess the process had to wait for the mutex but when the mutex was unlocked
the signal to the waiting process got lost.
Our software was build on a SUUSE 10.1
kernel 2.6.16.13-4-default
glibc-32bit-2.4-27
system as 32 bit binary. On This system the problem does not occur.
The system showing the problem is a dual processor SUSE 11.0
kernel 2.6.25.18-0.2-default
glibc-32bit-2.8-14.1
system. Also if the second core is disabled in the BIOS we have the
problem. Some more details of the system are listed below.
When the system hangs it does not recover from that situation. But if
I send a kill -STOP <pid>; kill -CONT <pid> sequence to the hanging
process it does continue.
The same effect is reached by connecting and disconnecting with gdb.
Is there any way, either in configuring the system or pthread, to prevent
this problem?
Cheers,
Michael
openSUSE 11.0 (X86-64)
Linux 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64 x86_64 GNU/Linux
glibc-32bit-2.8-14.1
processor : ...