Discussion:
Segmentation fault in sigsuspend via pthread_mutex_lock
(too old to reply)
Joshua
2003-07-11 06:44:49 UTC
Permalink
For some reason, about once a day my program crashes and generates a core
dump like this one.

I am using glibc 2.3.1 and gcc 3.2.2, and the program is in straight C.

*Any* ideas at all, no matter how simple or vauge would be greatly
appreciated, as I am totally out of ideas..

Thanks, Josh. core below.

backtrace
#0 0x40044447 in pthread_handle_sigrestart () from /lib/libpthread.so.0
#1 <signal handler called>
#2 0x400d8bf9 in sigsuspend () from /lib/libc.so.6
#3 0x40044838 in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
#4 0x40046070 in __pthread_alt_lock () from /lib/libpthread.so.0
#5 0x40043037 in pthread_mutex_lock () from /lib/libpthread.so.0
#6 0x08053a69 in fmm_alloc (fmm_t=0x8079398) at fmm_alloc.c:8
#7 0x08054cf8 in fmm_avl_insert (t=0x80a877c, item=0x489d39e3) at
fmm_avl.c:215
#8 0x0804d679 in avl_add_email (email=0x88f6cf9 "peaches24o",
file_offset=768380, status=1 '\001') at avl_add_email.c:43
#9 0x0804d84f in avl_add_email_from_file_info (f_info=0x8125d08) at
avl_add_email_from_file_info.c:21
#10 0x080560ae in load_list (thread_data=0x40a1b990) at load_list.c:47
#11 0x40042160 in pthread_start_thread () from /lib/libpthread.so.0
Joshua
2003-07-11 22:58:18 UTC
Permalink
From the trace, one can deduce that you are using Linux. Which kernel?
Furthermore, it looks like that you are using LinuxThreads (LT), but
I'd like to have a confirmation. Or are you using the Native Posix
Threads Library (NPTL)?
2.4.20, I do not know how to check whether I am using LT or NPTL. How would
I go about doing that?
Post by Joshua
*Any* ideas at all, no matter how simple or vauge would be greatly
appreciated, as I am totally out of ideas..
Let's see if I can help.
#12 A new thread is created.
#11 Your thread routine starts.
...
#5 Your thread locks a mutex, probably initialized with the defaut
attributes.
#4 Internal Pthread function is called for the mutex locking.
#3 The mutex has been already locked by another thread.
#2 Your thread gets suspended.
#1 Your thread is woken up: it aquires the mutex.
#0 Your thread excutes the signal handler corresponding to the
"restart signal"
Assuming LT, the only thing that pthread_handle_sigrestart does in
this case, is an instruction similar to thr_descr->p_signal = sig.
IOW, it sets a thread local storage variable.
The fact that this instruction generates a SIGSEGV lets think that the
1- whether a BUG in the Pthreads lib,
Possibly
2- or your program writes somewhere in the stack, where it shouldn't
(like for instance, an array bounds overflow).
I have extensively valgrind'd my program, and it works perfectly e.g. no
warnings no use of unitialised data etc.

Josh
Joshua
2003-07-13 01:59:02 UTC
Permalink
Hi Josh!
Post by Joshua
From the trace, one can deduce that you are using Linux. Which kernel?
Furthermore, it looks like that you are using LinuxThreads (LT), but
I'd like to have a confirmation. Or are you using the Native Posix
Threads Library (NPTL)?
2.4.20, I do not know how to check whether I am using LT or NPTL. How
would I go about doing that?
AFAIK, NPTL is only present on the Red-Hat v9.0 distro. Other distros
are using LT.
There has been a post from Paul on this newsgroup for determining that
out. Basically, you run ldd on your binary to know against which libc
you program is linked. And then you execute that libc to retrieve the
informations. Assuming that the libc is /lib/i686/libc.so.6, just
$ /lib/i686/libc.so.6
GNU C Library stable
[...]
linuxthreads-0.9 by Xavier Leroy
[...]
Post by Joshua
I have extensively valgrind'd my program, and it works perfectly e.g. no
warnings no use of unitialised data etc.
Did you used the latest version (1.9.6)? When not, please use this
one. Did you "valgrind" your program, until the SIGSEGV pops up? If
not, then please do this test first.
There has been a major achitectural change in glibc 2.3.x that has
impacted the Pthreads lib: the introduction of the so-called Thread
Local Storage (TLS) using thread registers. It means that the LT code
has been re-worked in glibc 2.3.x.
If valgrind fails to detect any problem, the next thing I would try is
to check your code on a system with a glibc version < 2.3 (e.g.
2.2.5). The point is that on 2.2.x, the "internal variables" of the
Pthreads lib are stored differently. It doesn't use thread registers,
but a fixed relation-ships between the stack pointer and the position
of the thread descriptor.
If you fail to reproduce the problem, we might suspect the Pthread
implementation changes in glibc 2.3.x. If you can reproduce the
problem, then we shall see...
Loic.
I think I have solved the problem... I updated from a 2.4.20 kernel to a
2.4.21 kernel and the program has not crashed in > 24 hours ( was crshing
hourly before this ). And as for the threads, gentoo has a more updated
version I believe.

Available extensions:
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
linuxthreads-0.10 by Xavier Leroy
BIND-8.2.3-T5B
libthread_db work sponsored by Alpha Processor Inc
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk


Thansk to all of you for your help.. ( Paul too ) I will keep your
responses in case I have problems in the future.

Paul Pluzhnikov
2003-07-12 18:42:38 UTC
Permalink
Post by Joshua
For some reason, about once a day my program crashes and generates a core
dump like this one.
#0 0x40044447 in pthread_handle_sigrestart () from /lib/libpthread.so.0
#1 <signal handler called>
#2 0x400d8bf9 in sigsuspend () from /lib/libc.so.6
#3 0x40044838 in __pthread_wait_for_restart_signal () from
I have seen quite a few core dumps on Linux threads, where the
stack trace has absolutely nothing to do with the place where
the program actually SIGSEGVd.

[I had just reproduced such behaviour on a trivial example (attached)
on an RH-7.0 using kernel 2.2.17, and the stack matches the one
above exactly].

Run the example on your system (a couple of times), analyze it's
core, and see if you get stack traces that end with sigsuspend().

If it does, turn '/sbin/sysctl -w kernel.core_uses_pid=1' and try
again. You may find the results much better then.

Cheers,
--
In order to understand recursion you must first understand recursion.

---- cut --- thr-crash.c ---
/* compile with "gcc -g -pthread thr-crash.c" */
#include <pthread.h>

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

void *func(void *p)
{
pthread_mutex_lock(&mtx);
*(int *)p = 1; /* crash here */

return p;
}

int main()
{
pthread_t tid1, tid2;
pthread_mutex_lock(&mtx);
pthread_create(&tid1, 0, func, 0);
pthread_create(&tid2, 0, func, 0);

sleep(1); /* wait for 2 threads to block on the mtx */
pthread_mutex_unlock(&mtx); /* let them run */
sleep(1);

return 0; /* unreached */
}
---- cut --- thr-crash.c ---
Loading...