Bug 11 - Failing/Non-terminating regression tests on OpenBSD.
: Failing/Non-terminating regression tests on OpenBSD.
Status: CONFIRMED
Product: SWI-Prolog
Classification: Unclassified
Component: core
: 5.10.x
: PC Other
: Highest normal
Assigned To: Jan Wielemaker
:
:
:
  Show dependency treegraph
 
Reported: 2011-09-17 17:46 CEST by Edd Barrett
Modified: 2012-02-20 09:06 CET (History)
1 user (show)

See Also:


Attachments
Test signal handling while waiting for a mutex (559 bytes, text/x-csrc)
2011-09-19 15:06 CEST, Jan Wielemaker
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Edd Barrett 2011-09-17 17:46:37 CEST
Whilst updating the OpenBSD port of swi to the latest stable, I notice the
following failing/non-terminating (script) regression tests:

thread/agc - non-terminating
thread/agc2 - non-terminating
thread/queue_resource - a thread segfaults, non-terminating
thread/test_agc_callback - non-terminating
thread/test_signal - non-terminating
thread/thread_agc_findall - non-terminating
thread/thread_agc_queue - non-terminating

In addition, the following non-script regression tests fail:

Test unicode_file(absfile-1) failed
Test unicode_file(ext-1) failed

I am unsure if these bugs are in SWI or in OpenBSD, but they have been failing
for some time (<= 5.10.4).

My prolog know-how is limited, but I am interested in helping fix these.
Comment 1 Jan Wielemaker 2011-09-18 13:40:14 CEST
Hi Edd,

The unicode file-name tests require that the system is configured properly,
notably runs a UTF-8 based locale.

The other issue is more worrying.  This would be interesting to know when
this has started.  Considering that so many of these tests fail indicates
that that one of the thread-synchronization primitives in use to make AGC
work doesn't work as expected on OpenBSD.  I'd suspect the implementation
of forThreadLocalData() in pl-thread.c.

If you are an experienced C system hacker you should be able to figure
out what happens there.  You can run swipl -d 1 <args> to get output
from the DEBUG statement in this file.

If this fails, I guess that the best solution would be for me to install
OpenBSD in a VirtualBox.  Please provide version info in that case.

    Cheers --- Jan
Comment 2 Edd Barrett 2011-09-18 14:26:58 CEST
OK, well, here is the debug output of the agc test:


?- agc.
[2] Executing thread signal
[2]: Prolog backtrace:
[2]: end Prolog backtrace:
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 


I am building a swipl with debug symbols, so that I can interrupt the program
at this point. Will keep you updated.

I am running OpenBSD-current (aka. developer snapshots). If you want to
experiment, I can show you how to get set up. I would reccommend you use the
ports tree, as there are a few patches we apply to make swi build/package on
OpenBSD. See here:

http://www.openbsd.org/cgi-bin/cvsweb/ports/lang/swi-prolog/patches/

Cheers
Comment 3 Jan Wielemaker 2011-09-18 15:06:52 CEST
(In reply to comment #2)
> OK, well, here is the debug output of the agc test:
> 
> 
> ?- agc.
> [2] Executing thread signal
> [2]: Prolog backtrace:
> [2]: end Prolog backtrace:
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 

Does it end here or does it block here?  I have the impression that
the thread_signal/2 doesn't arrive at the signalled thread.  This would
be a bit odd, as that code is fairly portable.

> I am building a swipl with debug symbols, so that I can interrupt the program
> at this point. Will keep you updated.
> 
> I am running OpenBSD-current (aka. developer snapshots). If you want to
> experiment, I can show you how to get set up. I would reccommend you use the
> ports tree, as there are a few patches we apply to make swi build/package on
> OpenBSD. See here:

If you can give me a quick list of steps to setup the stuff on a virtual
machine, this might be the quickest way out.  Host is AMD64, so I can run
either 64-bit or 32-bit VMs.

> 
> http://www.openbsd.org/cgi-bin/cvsweb/ports/lang/swi-prolog/patches/

I'll walk through these and include stuff in my version if it appears to
be generally useful.

     Cheers --- Jan
Comment 4 Edd Barrett 2011-09-18 15:21:22 CEST
> > Calling forThreadLocalData() from 2
> > Signalling 1
> > Signalled 1 threads.  Waiting ... 
> 
> Does it end here or does it block here?  I have the impression that
> the thread_signal/2 doesn't arrive at the signalled thread.  This would
> be a bit odd, as that code is fairly portable.

It blocks. And a ctrl+c can not kill it.



> > I am running OpenBSD-current (aka. developer snapshots). If you want to
> > experiment, I can show you how to get set up. I would reccommend you use the
> > ports tree, as there are a few patches we apply to make swi build/package on
> > OpenBSD. See here:
> 
> If you can give me a quick list of steps to setup the stuff on a virtual
> machine, this might be the quickest way out.  Host is AMD64, so I can run
> either 64-bit or 32-bit VMs.

Easiest way is like so:

Get this (amd64):
ftp://ftp.fr.openbsd.org/pub/OpenBSD/snapshots/amd64/install50.iso

Boot and follow the install procedure.

Login and do the following (we can do this all as root if we are just testing):

% cd /usr
% cvs -danoncvs@anoncvs.fr.openbsd.org:/cvs co ports
 ... takes a while...
% cd lang/swi-prolog
% export DEBUG=-g  # <- makes with debug symbols
% make build

Once that is done, you will find the sources and binaries in:
/usr/ports/pobj/swi-prolog-5.10.5

Hope this helps.

Is there an irc channel where we can discuss swi?
Comment 5 Edd Barrett 2011-09-18 15:36:51 CEST
Hrm, looks like teh debug symbols got stripped, annoying, but anyway, here is
teh backtrace of a hang:


(gdb) run -d 1
Starting program: /usr/local/bin/swipl -d 1
Starting Prolog Part of initialisation
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 2,204 bytes
Welcome to SWI-Prolog (Multi-threaded, 32 bits, Version 5.10.5)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

For help, use ?- help(Topic). or ?- apropos(Word).

?- [agc].
% agc compiled into agc 0.00 sec, 2,284 bytes
true.

?- agc.
[2] Executing thread signal
[2]: Prolog backtrace:
[2]: end Prolog backtrace:
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... [New process 26219]
^C
Program received signal SIGINT, Interrupt.
0x05fb2339 in poll () from /usr/lib/libc.so.60.1
(gdb) bt
#0  0x05fb2339 in poll () from /usr/lib/libc.so.60.1
#1  0x019df365 in _thread_kern_poll (wait_reqd=1) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:782
#2  0x019e0233 in _thread_kern_sched (scp=0x0) at
/usr/src/lib/libpthread/uthread/uthread_kern.c:383
#3  0x019e075f in _thread_kern_sched_state_unlock (state=PS_COND_WAIT,
lock=0x80362258, 
    fname=0x219cde14 "/usr/src/lib/libpthread/uthread/uthread_cond.c",
lineno=432) at /usr/src/lib/libpthread/uthread/uthread_kern.c:602
#4  0x019dcae7 in pthread_cond_timedwait (cond=0x219cf120, mutex=0x219cf11c,
abstime=0x899b0fc0)
    at /usr/src/lib/libpthread/uthread/uthread_cond.c:431
#5  0x019d7bae in _thread_gc (arg=0x0) at
/usr/src/lib/libpthread/uthread/uthread_gc.c:181
#6  0x019d75ce in _thread_start () at
/usr/src/lib/libpthread/uthread/uthread_create.c:242
#7  0x0000002b in ?? ()
#8  0x00000000 in ?? ()


Perhaps not as useful as we would like...

Notice how only one thread is signalled this time, is this a race condition?

The next run looked like this:

?- sgc.
Correct to: "agc"? yes
[2] Executing thread signal
[2]: Prolog backtrace:
[2]: end Prolog backtrace:
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 
        Done work on 1 (ok) All done!
Calling forThreadLocalData() from 2
Signalling 1
Signalled 1 threads.  Waiting ... 

Again, a different number of signals...

Could be a pthread bug in OpenBSD ofcourse... Not sure.
Comment 6 Jan Wielemaker 2011-09-18 16:04:26 CEST
(In reply to comment #4)

> Get this (amd64):
> ftp://ftp.fr.openbsd.org/pub/OpenBSD/snapshots/amd64/install50.iso

I think I'll try that

> Is there an irc channel where we can discuss swi?

Not as far as I know.  At least I'm not connected.

  --- Jan
Comment 7 Jan Wielemaker 2011-09-19 11:11:20 CEST
> ?- sgc.
> Correct to: "agc"? yes
> [2] Executing thread signal
> [2]: Prolog backtrace:
> [2]: end Prolog backtrace:
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
>         Done work on 1 (ok) All done!
> Calling forThreadLocalData() from 2
> Signalling 1
> Signalled 1 threads.  Waiting ... 
> 
> Again, a different number of signals...

I do not see the different number.  There are two threads.  One running
agc_loop/0, which will be thread-id 2 and the main one.  The main one 
creates atoms while the agc one collects them.  This goes on for a while.

The way this is supposed to work is that thread 2 initiates an atom-GC.
As atoms are shared, it will signal all other threads (i.e. thread main or 1).
These threads will mark the atoms they reference and signal a semaphore.  If
the AGC thread got a reply from all other threads it will collect the
unreferenced atoms and continue execution.  This doesn't seem to happen.

Its been a long time ago that anything changed to this code and it has
proven work on many platforms.  One possibility is that the used signal
(SIGUSR1 by default) is reversed for something else on OpenBSD.  This was
a problem on old Linux versions.

What would be interesting to know is whether this bug is present on older
OpenBSD versions.  If this is new for 5.0, a bug in 5.0 becomes likely.

I've downloaded OpenBSD and it is now installing the ports ...

     Cheers --- Jan
Comment 8 Jan Wielemaker 2011-09-19 13:46:24 CEST
Edd,

Got it working and reproduced.  Things look bad.  It turns out that OpenBSD
doesn't handle signals while waiting for a mutex.  This is essential for the
operation of the atom and clause garbage collectors.

There is one possible way-out: Windows also lacks signals and fixes the
problem by suspending the other threads.  OpenBSD seems to have 
pthread_suspend_np(), but it looks like this is built on top of signals.
This might not be good enough: I must have the guarantee that the suspended
thread is really suspended before pthread_suspend() returns, no matter what
this thread is doing.

Attached you find the test program.  This should print "Called handler".

What do you think?  Is this a bug?  Is there some setting that changes this
behavior?  Is thread-suspension usable?

Note that disabling threading is another option.  As we have recently
seen on the list though, the single-threaded version no longer compiles
out of the box.  I have no plans to fix this because simply too much
of the system and applications out there assume threads these days.

    Cheers --- Jan
Comment 9 Jan Wielemaker 2011-09-19 15:06:24 CEST
Created attachment 7 [details]
Test signal handling while waiting for a mutex

Should print that it handled the signal.  Nothing is printed on OpenBSD (5.0)
Comment 10 Edd Barrett 2011-09-19 16:03:24 CEST
Ah,

I just found a regression test (which succeeds) that is supposed to show that
signals are delivered to threads correctly:
http://www.openbsd.org/cgi-bin/cvsweb/src/regress/lib/libpthread/pthread_kill/pthread_kill.c?rev=1.4

Here, however, not mutex is involved. I notice that the manual page for
pthread_mutex_unlock() does not suggest that this function can return EINTR
(interrupted system call), so perhaps it is the case that signals are not
handled during pthread_mutex_unlock(). Maybe this is OpenBSD specific.

I don't know if that helps. I am going to try and find a thread hacker to ask.
Comment 11 Edd Barrett 2011-09-19 16:16:06 CEST
Actually, it would seem anything like EINTR is irrelevant. On a linux machine,
pthread_mutex_unlock() returns success (0).
Comment 12 Jan Wielemaker 2011-09-19 16:25:28 CEST
(In reply to comment #10)
> Ah,
> 
> I just found a regression test (which succeeds) that is supposed to show that
> signals are delivered to threads correctly:
> http://www.openbsd.org/cgi-bin/cvsweb/src/regress/lib/libpthread/pthread_kill/pthread_kill.c?rev=1.4

Well, it shows some aspects of pthread_kill(), but not all we need.  We need
to make sure that the signal delivery is not blocked indefinitely, no matter
what the thread is doing.  The only exception is that the thread explicitly
blocks the signal.  Most notably, we need signal delivery under all blocking
system calls, *including* pthread_mutex_lock() (and all the other blocking
pthread functions).

If that cannot be provided, we need a way to suspend a thread that will doesn't
return before the thread is suspended and (again) doesn't block indefinitely.
Having to go that way would make the code a lot dirtier :-(

> Here, however, not mutex is involved. I notice that the manual page for
> pthread_mutex_unlock() does not suggest that this function can return EINTR
> (interrupted system call), so perhaps it is the case that signals are not
> handled during pthread_mutex_unlock(). Maybe this is OpenBSD specific.

I assume you mean *lock*.  The unlock is non-blocking anyway :-).  I don't
know whether the standard has anything to say here (searched a bit, but the
only thing I can find is that signal handlers are not allowed to call
pthread_*; that is a different issue though).  Surely, signals are handled
while waiting for a lock on on most other POSIX systems, including freeBSD. 
Note that pthread_mutex_lock() doesn't return with EINTR, but restarts after
the signal is handled.

> I don't know if that helps. I am going to try and find a thread hacker to ask.

I hope there is a solution, otherwise it means end of SWI-Prolog for OpenBSD
:-(

     Cheers --- Jan
Comment 13 Jan Wielemaker 2011-09-19 16:37:31 CEST
(In reply to comment #11)
> Actually, it would seem anything like EINTR is irrelevant. On a linux machine,
> pthread_mutex_unlock() returns success (0).

I found this on
http://pubs.opengroup.org/onlinepubs/007908799/xsh/pthread_mutex_lock.html (The
Single UNIX ® Specification, Version 2):

"If a signal is delivered to a thread waiting for a mutex, upon return from the
signal handler the thread resumes waiting for the mutex as if it was not
interrupted."

Although this isn't really explicit, it suggests that the signal must be
delivered (if not blocked).

    Cheers --- Jan
Comment 14 Edd Barrett 2012-02-19 23:40:18 CET
Hi Jan,

Sorry to drag up the old conversation, but OpenBSD's new thread model has been
enabled in developer snapshots. I decided to re-try your regression tests.

This was the state of things when I started this thread:

thread/agc - non-terminating
thread/agc2 - non-terminating
thread/queue_resource - a thread segfaults, non-terminating
thread/test_agc_callback - non-terminating
thread/test_signal - non-terminating
thread/thread_agc_findall - non-terminating
thread/thread_agc_queue - non-terminating

And now:
thread/agc - pass
thread/agc2 - pass
thread/queue_resource - pass
thread/test_agc_callback - pass
thread/test_signal - bus error
thread/thread_agc_findall - pass
thread/thread_agc_queue - seg / non-termination

---8<---
?- test_signal.

SWI-Prolog [thread 1]: received fatal signal 11 (segv)
Stack trace labeled "crash":
  [0] 0xf3b9dec <backtrace+44> at /usr/local/lib/libexecinfo.so.1.0
  [1] 0xa0713d4 <PL_strtod+9796> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [2] 0xa071494 <PL_strtod+9988> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [3] 0xa022085 <PL_install_readline+4789> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [4] 0x6495004
  [5] 0xa038a70 <resumeThreads+2736> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [6] 0xa038d95 <resumeThreads+3541> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [7] 0xa03a841 <PL_thread_raise+913> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [8] 0x9fd38f7 <PL_next_solution+15831> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [9] 0xa009d48 <PL_is_ground+3320> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [10] 0x9fc69b2 <PL_toplevel+50> at
/usr/local/lib/swipl-5.10.5/lib/i386-openbsd5.1/libswipl.so
  [11] 0x1c0009b9 <main+105> at swipl
  [12] 0x1c000787 <___start+119> at swipl
---8<---


---8<---
thread_agc_queue:
?- thread_agc_findall.
sem_destroy on semaphore with waiters!
[FATAL ERROR:
        Received signal 11 (segv) while in -538976289-th garbage collection]
---8<---

All other tests pass.

This is good news, but do you have any comments on the remaining errors?
Ofcourse there may be bugs in our thread implementation still.

Cheers
Comment 15 Edd Barrett 2012-02-20 00:09:37 CET
By the way, don't read into this in too much detail, as there is a high chance
that it's our fault and not yours. The new thread system was enabled literally
a few days ago.
Comment 16 Jan Wielemaker 2012-02-20 09:06:53 CET
Hi Edd,

(In reply to comment #15)
> By the way, don't read into this in too much detail, as there is a high chance
> that it's our fault and not yours. The new thread system was enabled literally
> a few days ago.

Progress!  The first one is obviously jumping into the dark after the
resumeThreads().  The second gives no context at all.  Negative GC count
suggests a memory corruption.

What might be interesting is that both involve semaphores.  Semaphores
are used in some places because they are the only signal-safe synchronisation
primitive.  Could it be that the OpenBSD ones do not satisfy that?

I'd recompile using -DO_DEBUG (using recent source) and run

swipl -d msg_thread <options>

That might gives some hints on what is going on.  See comments and DEBUG
in pl-thread.c

    Cheers --- Jan