Bugzilla – Bug 11
Failing/Non-terminating regression tests on OpenBSD.
Last modified: 2012-02-20 09:06:53 CET
Whilst updating the OpenBSD port of swi to the latest stable, I notice the following failing/non-terminating (script) regression tests: thread/agc - non-terminating thread/agc2 - non-terminating thread/queue_resource - a thread segfaults, non-terminating thread/test_agc_callback - non-terminating thread/test_signal - non-terminating thread/thread_agc_findall - non-terminating thread/thread_agc_queue - non-terminating In addition, the following non-script regression tests fail: Test unicode_file(absfile-1) failed Test unicode_file(ext-1) failed I am unsure if these bugs are in SWI or in OpenBSD, but they have been failing for some time (<= 5.10.4). My prolog know-how is limited, but I am interested in helping fix these.
Hi Edd, The unicode file-name tests require that the system is configured properly, notably runs a UTF-8 based locale. The other issue is more worrying. This would be interesting to know when this has started. Considering that so many of these tests fail indicates that that one of the thread-synchronization primitives in use to make AGC work doesn't work as expected on OpenBSD. I'd suspect the implementation of forThreadLocalData() in pl-thread.c. If you are an experienced C system hacker you should be able to figure out what happens there. You can run swipl -d 1 <args> to get output from the DEBUG statement in this file. If this fails, I guess that the best solution would be for me to install OpenBSD in a VirtualBox. Please provide version info in that case. Cheers --- Jan
OK, well, here is the debug output of the agc test: ?- agc. [2] Executing thread signal [2]: Prolog backtrace: [2]: end Prolog backtrace: Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... I am building a swipl with debug symbols, so that I can interrupt the program at this point. Will keep you updated. I am running OpenBSD-current (aka. developer snapshots). If you want to experiment, I can show you how to get set up. I would reccommend you use the ports tree, as there are a few patches we apply to make swi build/package on OpenBSD. See here: http://www.openbsd.org/cgi-bin/cvsweb/ports/lang/swi-prolog/patches/ Cheers
(In reply to comment #2) > OK, well, here is the debug output of the agc test: > > > ?- agc. > [2] Executing thread signal > [2]: Prolog backtrace: > [2]: end Prolog backtrace: > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... Does it end here or does it block here? I have the impression that the thread_signal/2 doesn't arrive at the signalled thread. This would be a bit odd, as that code is fairly portable. > I am building a swipl with debug symbols, so that I can interrupt the program > at this point. Will keep you updated. > > I am running OpenBSD-current (aka. developer snapshots). If you want to > experiment, I can show you how to get set up. I would reccommend you use the > ports tree, as there are a few patches we apply to make swi build/package on > OpenBSD. See here: If you can give me a quick list of steps to setup the stuff on a virtual machine, this might be the quickest way out. Host is AMD64, so I can run either 64-bit or 32-bit VMs. > > http://www.openbsd.org/cgi-bin/cvsweb/ports/lang/swi-prolog/patches/ I'll walk through these and include stuff in my version if it appears to be generally useful. Cheers --- Jan
> > Calling forThreadLocalData() from 2 > > Signalling 1 > > Signalled 1 threads. Waiting ... > > Does it end here or does it block here? I have the impression that > the thread_signal/2 doesn't arrive at the signalled thread. This would > be a bit odd, as that code is fairly portable. It blocks. And a ctrl+c can not kill it. > > I am running OpenBSD-current (aka. developer snapshots). If you want to > > experiment, I can show you how to get set up. I would reccommend you use the > > ports tree, as there are a few patches we apply to make swi build/package on > > OpenBSD. See here: > > If you can give me a quick list of steps to setup the stuff on a virtual > machine, this might be the quickest way out. Host is AMD64, so I can run > either 64-bit or 32-bit VMs. Easiest way is like so: Get this (amd64): ftp://ftp.fr.openbsd.org/pub/OpenBSD/snapshots/amd64/install50.iso Boot and follow the install procedure. Login and do the following (we can do this all as root if we are just testing): % cd /usr % cvs -danoncvs@anoncvs.fr.openbsd.org:/cvs co ports ... takes a while... % cd lang/swi-prolog % export DEBUG=-g # <- makes with debug symbols % make build Once that is done, you will find the sources and binaries in: /usr/ports/pobj/swi-prolog-5.10.5 Hope this helps. Is there an irc channel where we can discuss swi?
Hrm, looks like teh debug symbols got stripped, annoying, but anyway, here is teh backtrace of a hang: (gdb) run -d 1 Starting program: /usr/local/bin/swipl -d 1 Starting Prolog Part of initialisation % library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 2,204 bytes Welcome to SWI-Prolog (Multi-threaded, 32 bits, Version 5.10.5) Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. Please visit http://www.swi-prolog.org for details. For help, use ?- help(Topic). or ?- apropos(Word). ?- [agc]. % agc compiled into agc 0.00 sec, 2,284 bytes true. ?- agc. [2] Executing thread signal [2]: Prolog backtrace: [2]: end Prolog backtrace: Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... [New process 26219] ^C Program received signal SIGINT, Interrupt. 0x05fb2339 in poll () from /usr/lib/libc.so.60.1 (gdb) bt #0 0x05fb2339 in poll () from /usr/lib/libc.so.60.1 #1 0x019df365 in _thread_kern_poll (wait_reqd=1) at /usr/src/lib/libpthread/uthread/uthread_kern.c:782 #2 0x019e0233 in _thread_kern_sched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthread_kern.c:383 #3 0x019e075f in _thread_kern_sched_state_unlock (state=PS_COND_WAIT, lock=0x80362258, fname=0x219cde14 "/usr/src/lib/libpthread/uthread/uthread_cond.c", lineno=432) at /usr/src/lib/libpthread/uthread/uthread_kern.c:602 #4 0x019dcae7 in pthread_cond_timedwait (cond=0x219cf120, mutex=0x219cf11c, abstime=0x899b0fc0) at /usr/src/lib/libpthread/uthread/uthread_cond.c:431 #5 0x019d7bae in _thread_gc (arg=0x0) at /usr/src/lib/libpthread/uthread/uthread_gc.c:181 #6 0x019d75ce in _thread_start () at /usr/src/lib/libpthread/uthread/uthread_create.c:242 #7 0x0000002b in ?? () #8 0x00000000 in ?? () Perhaps not as useful as we would like... Notice how only one thread is signalled this time, is this a race condition? The next run looked like this: ?- sgc. Correct to: "agc"? yes [2] Executing thread signal [2]: Prolog backtrace: [2]: end Prolog backtrace: Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Done work on 1 (ok) All done! Calling forThreadLocalData() from 2 Signalling 1 Signalled 1 threads. Waiting ... Again, a different number of signals... Could be a pthread bug in OpenBSD ofcourse... Not sure.
(In reply to comment #4) > Get this (amd64): > ftp://ftp.fr.openbsd.org/pub/OpenBSD/snapshots/amd64/install50.iso I think I'll try that > Is there an irc channel where we can discuss swi? Not as far as I know. At least I'm not connected. --- Jan
> ?- sgc. > Correct to: "agc"? yes > [2] Executing thread signal > [2]: Prolog backtrace: > [2]: end Prolog backtrace: > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > Done work on 1 (ok) All done! > Calling forThreadLocalData() from 2 > Signalling 1 > Signalled 1 threads. Waiting ... > > Again, a different number of signals... I do not see the different number. There are two threads. One running agc_loop/0, which will be thread-id 2 and the main one. The main one creates atoms while the agc one collects them. This goes on for a while. The way this is supposed to work is that thread 2 initiates an atom-GC. As atoms are shared, it will signal all other threads (i.e. thread main or 1). These threads will mark the atoms they reference and signal a semaphore. If the AGC thread got a reply from all other threads it will collect the unreferenced atoms and continue execution. This doesn't seem to happen. Its been a long time ago that anything changed to this code and it has proven work on many platforms. One possibility is that the used signal (SIGUSR1 by default) is reversed for something else on OpenBSD. This was a problem on old Linux versions. What would be interesting to know is whether this bug is present on older OpenBSD versions. If this is new for 5.0, a bug in 5.0 becomes likely. I've downloaded OpenBSD and it is now installing the ports ... Cheers --- Jan
Edd, Got it working and reproduced. Things look bad. It turns out that OpenBSD doesn't handle signals while waiting for a mutex. This is essential for the operation of the atom and clause garbage collectors. There is one possible way-out: Windows also lacks signals and fixes the problem by suspending the other threads. OpenBSD seems to have pthread_suspend_np(), but it looks like this is built on top of signals. This might not be good enough: I must have the guarantee that the suspended thread is really suspended before pthread_suspend() returns, no matter what this thread is doing. Attached you find the test program. This should print "Called handler". What do you think? Is this a bug? Is there some setting that changes this behavior? Is thread-suspension usable? Note that disabling threading is another option. As we have recently seen on the list though, the single-threaded version no longer compiles out of the box. I have no plans to fix this because simply too much of the system and applications out there assume threads these days. Cheers --- Jan
Created attachment 7 [details] Test signal handling while waiting for a mutex Should print that it handled the signal. Nothing is printed on OpenBSD (5.0)
Ah, I just found a regression test (which succeeds) that is supposed to show that signals are delivered to threads correctly: http://www.openbsd.org/cgi-bin/cvsweb/src/regress/lib/libpthread/pthread_kill/pthread_kill.c?rev=1.4 Here, however, not mutex is involved. I notice that the manual page for pthread_mutex_unlock() does not suggest that this function can return EINTR (interrupted system call), so perhaps it is the case that signals are not handled during pthread_mutex_unlock(). Maybe this is OpenBSD specific. I don't know if that helps. I am going to try and find a thread hacker to ask.
Actually, it would seem anything like EINTR is irrelevant. On a linux machine, pthread_mutex_unlock() returns success (0).
(In reply to comment #10) > Ah, > > I just found a regression test (which succeeds) that is supposed to show that > signals are delivered to threads correctly: > http://www.openbsd.org/cgi-bin/cvsweb/src/regress/lib/libpthread/pthread_kill/pthread_kill.c?rev=1.4 Well, it shows some aspects of pthread_kill(), but not all we need. We need to make sure that the signal delivery is not blocked indefinitely, no matter what the thread is doing. The only exception is that the thread explicitly blocks the signal. Most notably, we need signal delivery under all blocking system calls, *including* pthread_mutex_lock() (and all the other blocking pthread functions). If that cannot be provided, we need a way to suspend a thread that will doesn't return before the thread is suspended and (again) doesn't block indefinitely. Having to go that way would make the code a lot dirtier :-( > Here, however, not mutex is involved. I notice that the manual page for > pthread_mutex_unlock() does not suggest that this function can return EINTR > (interrupted system call), so perhaps it is the case that signals are not > handled during pthread_mutex_unlock(). Maybe this is OpenBSD specific. I assume you mean *lock*. The unlock is non-blocking anyway :-). I don't know whether the standard has anything to say here (searched a bit, but the only thing I can find is that signal handlers are not allowed to call pthread_*; that is a different issue though). Surely, signals are handled while waiting for a lock on on most other POSIX systems, including freeBSD. Note that pthread_mutex_lock() doesn't return with EINTR, but restarts after the signal is handled. > I don't know if that helps. I am going to try and find a thread hacker to ask. I hope there is a solution, otherwise it means end of SWI-Prolog for OpenBSD :-( Cheers --- Jan
(In reply to comment #11) > Actually, it would seem anything like EINTR is irrelevant. On a linux machine, > pthread_mutex_unlock() returns success (0). I found this on http://pubs.opengroup.org/onlinepubs/007908799/xsh/pthread_mutex_lock.html (The Single UNIX ® Specification, Version 2): "If a signal is delivered to a thread waiting for a mutex, upon return from the signal handler the thread resumes waiting for the mutex as if it was not interrupted." Although this isn't really explicit, it suggests that the signal must be delivered (if not blocked). Cheers --- Jan
Hi Jan, Sorry to drag up the old conversation, but OpenBSD's new thread model has been enabled in developer snapshots. I decided to re-try your regression tests. This was the state of things when I started this thread: thread/agc - non-terminating thread/agc2 - non-terminating thread/queue_resource - a thread segfaults, non-terminating thread/test_agc_callback - non-terminating thread/test_signal - non-terminating thread/thread_agc_findall - non-terminating thread/thread_agc_queue - non-terminating And now: thread/agc - pass thread/agc2 - pass thread/queue_resource - pass thread/test_agc_callback - pass thread/test_signal - bus error thread/thread_agc_findall - pass thread/thread_agc_queue - seg / non-termination ---8<--- ?- test_signal. SWI-Prolog [thread 1]: received fatal signal 11 (segv) Stack trace labeled "crash": [0] 0xf3b9dec <backtrace+44> at /usr/local/lib/libexecinfo.so.1.0 [1] 0xa0713d4 <PL_strtod+9796> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [2] 0xa071494 <PL_strtod+9988> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [3] 0xa022085 <PL_install_readline+4789> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [4] 0x6495004 [5] 0xa038a70 <resumeThreads+2736> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [6] 0xa038d95 <resumeThreads+3541> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [7] 0xa03a841 <PL_thread_raise+913> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [8] 0x9fd38f7 <PL_next_solution+15831> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [9] 0xa009d48 <PL_is_ground+3320> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [10] 0x9fc69b2 <PL_toplevel+50> at /usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so [11] 0x1c0009b9 <main+105> at swipl [12] 0x1c000787 <___start+119> at swipl ---8<--- ---8<--- thread_agc_queue: ?- thread_agc_findall. sem_destroy on semaphore with waiters! [FATAL ERROR: Received signal 11 (segv) while in -538976289-th garbage collection] ---8<--- All other tests pass. This is good news, but do you have any comments on the remaining errors? Ofcourse there may be bugs in our thread implementation still. Cheers
By the way, don't read into this in too much detail, as there is a high chance that it's our fault and not yours. The new thread system was enabled literally a few days ago.
Hi Edd, (In reply to comment #15) > By the way, don't read into this in too much detail, as there is a high chance > that it's our fault and not yours. The new thread system was enabled literally > a few days ago. Progress! The first one is obviously jumping into the dark after the resumeThreads(). The second gives no context at all. Negative GC count suggests a memory corruption. What might be interesting is that both involve semaphores. Semaphores are used in some places because they are the only signal-safe synchronisation primitive. Could it be that the OpenBSD ones do not satisfy that? I'd recompile using -DO_DEBUG (using recent source) and run swipl -d msg_thread <options> That might gives some hints on what is going on. See comments and DEBUG in pl-thread.c Cheers --- Jan