Random crashing

Discussion:

Random crashing

Osei Poku

2008-07-02 21:44:31 UTC

Hello,

About 5 times a day on a particular machine, ccl drops into the kernel
debugger in an unrecoverable way. ie pressing X does not return
control back to lisp. The following is a copy of the output on the
terminal. I have not included the other half of the session with is
on the other side of swank because it might not be necessary to debug
this problem. If it is needed to completely understand the problem, I
can provide that directly. As is shown in the output, the lisp
backtrace is not available. So there might be something other than
the lisp code going on here. As I said, this problem has only occured
on this particular machine.

The output of uname -a is

Linux fatterbox 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC
x86_64 unknown unknown GNU/Linux

Any help/insight into what this is about is appreciated.

Osei

bash-2.05a$ lisp

; loading system definition from ccl:tools;asdf-install;asdf-
install.asd.newest into #<Package "ASDF0">

; registering #<SYSTEM ASDF-INSTALL #x300040E5BD6D> as ASDF-INSTALL

;;; ASDF-Install version 0.6.10

; loading system definition from home:slime;swank.asd.newest into
#<Package "ASDF0">

; registering #<SYSTEM :SWANK #x300040F39FAD> as SWANK

;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-backend.lx64fsl"...

;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/metering.lx64fsl"...

;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-openmcl.lx64fsl"...

;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-gray.lx64fsl"...

;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank.lx64fsl"...

; Warning: These Swank interfaces are unimplemented:

; (ACTIVATE-STEPPING ADD-FD-HANDLER ADD-SIGIO-HANDLER CALLS-
WHO FIND-SOURCE-LOCATION MACROEXPAND-ALL REMOVE-FD-HANDLERS REMOVE-
SIGIO-HANDLERS RESTART-FRAME RETURN-FROM-FRAME SLDB-BREAK-AT-START
SLDB-BREAK-ON-RETURN SLDB-STEP-INTO SLDB-STEP-NEXT SLDB-STEP-OUT)

; While executing: SWANK-BACKEND::WARN-UNIMPLEMENTED-INTERFACES, in
process listener(1).

Welcome to Clozure Common Lisp Version 1.2-r9226-RC1 (LinuxX8664)!

? (swank:create-server :port 4007 :dont-close t)

;; Swank started at port: 4007.

4007

? exception in foreign context

Exception occurred while executing foreign code

? for help

[20166] OpenMCL kernel debugger: ?

(G) Set specified GPR to new value

(R) Show raw GPR/SPR register values

(L) Show Lisp values of tagged registers

(F) Show FPU registers

(S) Find and describe symbol matching specified name

(B) Show backtrace

(T) Show info about current thread

(X) Exit from this debugger, asserting that any exception was handled

(K) Kill OpenMCL process

(?) Show this help

[20166] OpenMCL kernel debugger: R

%rax = 0x0000000000000000 %r8 = 0x000000004072B7E8

%rcx = 0xFFFFFFFFFFFFFFFF %r9 = 0x000000004072B7E8

%rdx = 0x0000000000000000 %r10 = 0x0000000000000000

%rbx = 0x0000000040E577E8 %r11 = 0x0000000000000246

%rsp = 0x000000004072A278 %r12 = 0x000000004072B7E8

%rbp = 0x000000004072A730 %r13 = 0x000000004072A758

%rsi = 0x0000000000000028 %r14 = 0x0000000000000004

%rdi = 0x0000000000000000 %r15 = 0x000000004072AAE0

%rip = 0x00002B56B5BE22A0 %rflags = 0x0000000000010246

[20166] OpenMCL kernel debugger: F

f00: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f01: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f02: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f03: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f04: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f05: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f06: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f07: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f08: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f09: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f10: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f11: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f12: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f13: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f14: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

f15: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)

mxcsr = 0x00001f80

[20166] OpenMCL kernel debugger: B

Framepointer [#x4072A730] in unknown area.

[20166] OpenMCL kernel debugger: T

Current Thread Context Record (tcr) = 0x4072b7e8

Control (C) stack area: low = 0x404d8000, high = 0x4072c000

Value (lisp) stack area: low = 0x2aaaab2f1000, high = 0x2aaaab502000

Exception stack pointer = 0x4072a278

[20166] OpenMCL kernel debugger: L

%rsi (arg_z) = 5

%rdi (arg_y) = 0

%r8 (arg_x) = 135157501

------

%r13 (fn) = 135156971

------

%r15 (save0) = 135157084

Segmentation fault

Gary Byers

2008-07-02 23:01:21 UTC

Permalink

About the only thing that I can tell you is that you called
SWANK:CREATE-SERVER and crashed in foreign (C) code at the
address 0x00002B56B5BE22A0. I don't know what foreign code
is at that address, but the lisp kernel is generally down
around 0x410000 on x86-64 Linux, so the address is most
likely in some shared library (if it's anywhere at all.)

On Linux, you can get a coarse idea of what memory regions are mapped
(and, when applicable, of what files they're mapped to) by looking
at /proc/<pid>/maps, where <pid> is the process id of the lisp
process. It might be good to know if the address is mapped and
what it's mapped to, but if the problem is something like "a bad
parameter is being passed to some foreign function", we'd really
need to know what foreign function.

There's been a problem in 1.2, whereby foreign pointers (MACPTRs)
don't get invalidated when an image is saved. (It's generally
the case that a foreign address is "per session"; invalidating
the pointer is supposed to make it harder to use a stale foreign
address.) I only got around to fixing that in 1.2 a few days ago;
it was part of the problem that kept someone from loading shared
libraries on FreeBSD. I never figured out exactly -why- that
was part of the problem, but it certainly seemed to be.

If you can do an "svn update" and a (rebuild-ccl t) and the
problem goes away, great ... if not, I can try to explain how
to debug this with GDB, but it may take a while to track it
down this way. (I -would- like to track this down.)

Post by Osei Poku
Hello,
About 5 times a day on a particular machine, ccl drops into the kernel
debugger in an unrecoverable way. ie pressing X does not return
control back to lisp. The following is a copy of the output on the
terminal. I have not included the other half of the session with is
on the other side of swank because it might not be necessary to debug
this problem. If it is needed to completely understand the problem, I
can provide that directly. As is shown in the output, the lisp
backtrace is not available. So there might be something other than
the lisp code going on here. As I said, this problem has only occured
on this particular machine.
The output of uname -a is
Linux fatterbox 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC
x86_64 unknown unknown GNU/Linux
Any help/insight into what this is about is appreciated.
Osei
bash-2.05a$ lisp
; loading system definition from ccl:tools;asdf-install;asdf-
install.asd.newest into #<Package "ASDF0">
; registering #<SYSTEM ASDF-INSTALL #x300040E5BD6D> as ASDF-INSTALL
;;; ASDF-Install version 0.6.10
; loading system definition from home:slime;swank.asd.newest into
#<Package "ASDF0">
; registering #<SYSTEM :SWANK #x300040F39FAD> as SWANK
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-backend.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/metering.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-openmcl.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-gray.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank.lx64fsl"...
; (ACTIVATE-STEPPING ADD-FD-HANDLER ADD-SIGIO-HANDLER CALLS-
WHO FIND-SOURCE-LOCATION MACROEXPAND-ALL REMOVE-FD-HANDLERS REMOVE-
SIGIO-HANDLERS RESTART-FRAME RETURN-FROM-FRAME SLDB-BREAK-AT-START
SLDB-BREAK-ON-RETURN SLDB-STEP-INTO SLDB-STEP-NEXT SLDB-STEP-OUT)
; While executing: SWANK-BACKEND::WARN-UNIMPLEMENTED-INTERFACES, in
process listener(1).
Welcome to Clozure Common Lisp Version 1.2-r9226-RC1 (LinuxX8664)!
? (swank:create-server :port 4007 :dont-close t)
;; Swank started at port: 4007.
4007
? exception in foreign context
Exception occurred while executing foreign code
? for help
[20166] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[20166] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x000000004072B7E8
%rcx = 0xFFFFFFFFFFFFFFFF %r9 = 0x000000004072B7E8
%rdx = 0x0000000000000000 %r10 = 0x0000000000000000
%rbx = 0x0000000040E577E8 %r11 = 0x0000000000000246
%rsp = 0x000000004072A278 %r12 = 0x000000004072B7E8
%rbp = 0x000000004072A730 %r13 = 0x000000004072A758
%rsi = 0x0000000000000028 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAE0
%rip = 0x00002B56B5BE22A0 %rflags = 0x0000000000010246
[20166] OpenMCL kernel debugger: F
f00: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f01: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f02: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f03: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f04: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f05: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f06: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f07: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f08: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f09: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f10: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f11: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f12: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f13: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f14: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f15: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
mxcsr = 0x00001f80
[20166] OpenMCL kernel debugger: B
Framepointer [#x4072A730] in unknown area.
[20166] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7e8
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab2f1000, high = 0x2aaaab502000
Exception stack pointer = 0x4072a278
[20166] OpenMCL kernel debugger: L
%rsi (arg_z) = 5
%rdi (arg_y) = 0
%r8 (arg_x) = 135157501
------
%r13 (fn) = 135156971
------
%r15 (save0) = 135157084
Segmentation fault
_______________________________________________
Openmcl-devel mailing list
http://clozure.com/mailman/listinfo/openmcl-devel

Osei Poku

2008-07-02 23:15:26 UTC

Permalink

I'll look into what you suggested but just fyi.. the crash happens
about a couple of hours after swank:create-server was called during
development on emacs/slime. Thats the other half of what going on
that is hidden from view.

Thanks for the suggestions though. I'll report anything I find.

Post by Gary Byers
About the only thing that I can tell you is that you called
SWANK:CREATE-SERVER and crashed in foreign (C) code at the address
0x00002B56B5BE22A0. I don't know what foreign code
is at that address, but the lisp kernel is generally down
around 0x410000 on x86-64 Linux, so the address is most
likely in some shared library (if it's anywhere at all.)
On Linux, you can get a coarse idea of what memory regions are mapped
(and, when applicable, of what files they're mapped to) by looking
at /proc/<pid>/maps, where <pid> is the process id of the lisp
process. It might be good to know if the address is mapped and
what it's mapped to, but if the problem is something like "a bad
parameter is being passed to some foreign function", we'd really
need to know what foreign function.
There's been a problem in 1.2, whereby foreign pointers (MACPTRs)
don't get invalidated when an image is saved. (It's generally
the case that a foreign address is "per session"; invalidating
the pointer is supposed to make it harder to use a stale foreign
address.) I only got around to fixing that in 1.2 a few days ago;
it was part of the problem that kept someone from loading shared
libraries on FreeBSD. I never figured out exactly -why- that
was part of the problem, but it certainly seemed to be.
If you can do an "svn update" and a (rebuild-ccl t) and the
problem goes away, great ... if not, I can try to explain how
to debug this with GDB, but it may take a while to track it down
this way. (I -would- like to track this down.)

Post by Osei Poku
Hello,
About 5 times a day on a particular machine, ccl drops into the kernel
debugger in an unrecoverable way. ie pressing X does not return
control back to lisp. The following is a copy of the output on the
terminal. I have not included the other half of the session with is
on the other side of swank because it might not be necessary to debug
this problem. If it is needed to completely understand the
problem, I
can provide that directly. As is shown in the output, the lisp
backtrace is not available. So there might be something other than
the lisp code going on here. As I said, this problem has only occured
on this particular machine.
The output of uname -a is
Linux fatterbox 2.6.22.5-31-default #1 SMP 2007/09/21 22:29:00 UTC
x86_64 unknown unknown GNU/Linux
Any help/insight into what this is about is appreciated.
Osei
bash-2.05a$ lisp
; loading system definition from ccl:tools;asdf-install;asdf-
install.asd.newest into #<Package "ASDF0">
; registering #<SYSTEM ASDF-INSTALL #x300040E5BD6D> as ASDF-INSTALL
;;; ASDF-Install version 0.6.10
; loading system definition from home:slime;swank.asd.newest into
#<Package "ASDF0">
; registering #<SYSTEM :SWANK #x300040F39FAD> as SWANK
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-backend.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/metering.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-openmcl.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank-gray.lx64fsl"...
;Loading #P"/home/wtam/.slime/fasl/2008-04-24/openmcl-version_1.2-
r9226-rc1__(linuxx8664)-linux-x86-64/swank.lx64fsl"...
; (ACTIVATE-STEPPING ADD-FD-HANDLER ADD-SIGIO-HANDLER CALLS-
WHO FIND-SOURCE-LOCATION MACROEXPAND-ALL REMOVE-FD-HANDLERS REMOVE-
SIGIO-HANDLERS RESTART-FRAME RETURN-FROM-FRAME SLDB-BREAK-AT-START
SLDB-BREAK-ON-RETURN SLDB-STEP-INTO SLDB-STEP-NEXT SLDB-STEP-OUT)
; While executing: SWANK-BACKEND::WARN-UNIMPLEMENTED-INTERFACES, in
process listener(1).
Welcome to Clozure Common Lisp Version 1.2-r9226-RC1 (LinuxX8664)!
? (swank:create-server :port 4007 :dont-close t)
;; Swank started at port: 4007.
4007
? exception in foreign context
Exception occurred while executing foreign code
? for help
[20166] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[20166] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x000000004072B7E8
%rcx = 0xFFFFFFFFFFFFFFFF %r9 = 0x000000004072B7E8
%rdx = 0x0000000000000000 %r10 = 0x0000000000000000
%rbx = 0x0000000040E577E8 %r11 = 0x0000000000000246
%rsp = 0x000000004072A278 %r12 = 0x000000004072B7E8
%rbp = 0x000000004072A730 %r13 = 0x000000004072A758
%rsi = 0x0000000000000028 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAE0
%rip = 0x00002B56B5BE22A0 %rflags = 0x0000000000010246
[20166] OpenMCL kernel debugger: F
f00: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f01: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f02: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f03: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f04: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f05: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f06: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f07: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f08: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f09: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f10: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f11: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f12: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f13: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f14: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
f15: 0x00000000 (0.000000e+00), 0x0000000000000000 (0.000000e+00)
mxcsr = 0x00001f80
[20166] OpenMCL kernel debugger: B
Framepointer [#x4072A730] in unknown area.
[20166] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7e8
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab2f1000, high = 0x2aaaab502000
Exception stack pointer = 0x4072a278
[20166] OpenMCL kernel debugger: L
%rsi (arg_z) = 5
%rdi (arg_y) = 0
%r8 (arg_x) = 135157501
------
%r13 (fn) = 135156971
------
%r15 (save0) = 135157084
Segmentation fault
_______________________________________________
Openmcl-devel mailing list
http://clozure.com/mailman/listinfo/openmcl-devel

Osei Poku

2008-07-09 18:26:56 UTC

Permalink

Hi,

It crashed again for me. This time I managed to grab the contents of /
proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the repository
to r9890-RC1.

Osei

Gary Byers

2008-07-09 19:05:23 UTC

Permalink

Post by Osei Poku
Hi,
It crashed again for me. This time I managed to grab the contents of
/proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the repository to
r9890-RC1.
Osei

It seems to be crashed in the threads library (libpthread.so).

There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).

When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.

I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

Osei Poku

2008-07-17 19:28:34 UTC

Permalink

Hello,

I updated today from svn but this thing happened again. Again the PC
was in the pthread memory region and %rdi was 0. I verified that the
fix (r9997 i think) was in my ccl working directory (somewhere in
thread_manager.c right?).

My current version is:
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!

Is there anything other than (rebuild-ccl :force t) that I need to do
to recompile the c source for the lisp kernel?

Thanks,
Osei

Post by Gary Byers

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

Osei Poku

2008-07-17 19:44:42 UTC

Permalink

Post by Osei Poku
Hello,
I updated today from svn but this thing happened again. Again the PC
was in the pthread memory region and %rdi was 0. I verified that the
fix (r9997 i think) was in my ccl working directory (somewhere in
thread_manager.c right?).
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
Is there anything other than (rebuild-ccl :force t) that I need to do
to recompile the c source for the lisp kernel?

To rebuild the kernel, you need to do (rebuild-ccl :FULL t).

Ah. Cool thanks.

Post by Osei Poku

Post by Gary Byers

Post by Osei Poku
Hi,
It crashed again for me. This time I managed to grab the

contents of

Post by Gary Byers

Post by Osei Poku
/proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the
repository to
r9890-RC1.
Osei

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

_______________________________________________
Openmcl-devel mailing list
http://clozure.com/mailman/listinfo/openmcl-devel

Gail Zacharias

2008-07-17 19:43:06 UTC

Permalink

To rebuild the kernel, you need to do (rebuild-ccl :FULL t).

Post by Osei Poku

Post by Gary Byers

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

_______________________________________________
Openmcl-devel mailing list
http://clozure.com/mailman/listinfo/openmcl-devel

Gary Byers

2008-07-17 19:54:02 UTC

Permalink

Yes; there are 3 calls to pthread_kill() in that file. One of
them (in resume_tcr()) is conditionlized out; the other two
(in raise_thread_interrupt() and suspend_tcr()) should check
to make sure that the thread that they'd pass as the first
argument to pthread_kill is non-zero before doing the call.)

Post by Osei Poku
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
Is there anything other than (rebuild-ccl :force t) that I need to do to
recompile the c source for the lisp kernel?

As Gail just pointed out, :full t (or :kernel t) is necessary
in order to get the kernel updated. (:force t will recompile
FASLs even if they're newer than the corresponding source;
that's occasionally useful, but not really what you want here.)

If the kernel that you're running had its modified date change
by the rebuild process, it likely incorporates those changes. If
those changes didn't fix the problem, then I don't have a good
guess as to what the problem is: there aren't too many places
where the lisp calls into the threads library: it creates threads
and sends them signals via pthread_kill(). (There's another
place where a thread will send itself a signal via pthread_kill(),
but that is pretty much guaranteed to be a valid thread ...)

Post by Osei Poku
Thanks,
Osei

Post by Gary Byers

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

Osei Poku

2008-07-18 16:25:50 UTC

Permalink

Ok... It happened again after recompiling the kernel. I managed to
attach a gdb session to the process and it is still running so I can
possible provide more feedback if you need. My current gdb session
log is inserted below.

/usr/bin/gdb

GNU gdb 6.6.50.20070726-cvs
Copyright (C) 2007 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "x86_64-suse-linux".
(gdb) attach 3268
Attaching to process 3268
Reading symbols from /home/opoku/local/share/ccl/lx86cl64...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2adafe820880 (LWP 3268)]
[New Thread 0x410bb950 (LWP 6095)]
[New Thread 0x4131f950 (LWP 6094)]
[New Thread 0x40e57950 (LWP 6093)]
[New Thread 0x40bf3950 (LWP 3307)]
[New Thread 0x4098f950 (LWP 3306)]
[New Thread 0x4072b950 (LWP 3305)]
[New Thread 0x404c7950 (LWP 3272)]
[New Thread 0x40263950 (LWP 3271)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libssl.so...done.
Loaded symbols for /usr/lib64/libssl.so
Reading symbols from /usr/lib64/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib64/libcrypto.so.0.9.8
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/
clsql-4.0.3/uffi/clsql_uffi.so...done.
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/
uffi/clsql_uffi.so
Reading symbols from /usr/lib64/libmysqlclient.so...done.
Loaded symbols for /usr/lib64/libmysqlclient.so
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /home/opoku/work/code/diagnosis/library/
clsql-4.0.3/db-mysql/clsql_mysql.so...done.
Loaded symbols for /home/opoku/work/code/diagnosis/library/clsql-4.0.3/
db-mysql/clsql_mysql.so
0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00002adafe2ca2cb in sem_timedwait () from /lib64/libpthread.so.0
#1 0x000000000041b89c in sem_wait_forever (s=0x6476b0) at ../
thread_manager.c:338
#2 0x000000000041bfef in suspend_resume_handler (signo=40,
info=<value optimized out>, context=0x7ffface5c800) at ../
thread_manager.c:455
#3 <signal handler called>
#4 0x00002adafe2cb5c1 in nanosleep () from /lib64/libpthread.so.0
#5 0x00000000004105da in _SPffcall () at ../x86-spentry64.s:3983
#6 0x00007ffface5cf10 in ?? ()
#7 0x00002adafea64f88 in ?? ()
#8 0x000000000000031a in ?? ()
#9 0x00007ffface5cf00 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb) info threads
9 Thread 0x40263950 (LWP 3271) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
8 Thread 0x404c7950 (LWP 3272) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
7 Thread 0x4072b950 (LWP 3305) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
6 Thread 0x4098f950 (LWP 3306) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
5 Thread 0x40bf3950 (LWP 3307) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
4 Thread 0x40e57950 (LWP 6093) 0x00002adafe591bfb in read () from /
lib64/libc.so.6
3 Thread 0x4131f950 (LWP 6094) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
2 Thread 0x410bb950 (LWP 6095) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
1 Thread 0x2adafe820880 (LWP 3268) 0x00002adafe2ca2cb in
sem_timedwait () from /lib64/libpthread.so.0

Yes; there are 3 calls to pthread_kill() in that file. One of them
(in resume_tcr()) is conditionlized out; the other two
(in raise_thread_interrupt() and suspend_tcr()) should check
to make sure that the thread that they'd pass as the first
argument to pthread_kill is non-zero before doing the call.)

Post by Osei Poku
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
Is there anything other than (rebuild-ccl :force t) that I need to
do to recompile the c source for the lisp kernel?

As Gail just pointed out, :full t (or :kernel t) is necessary
in order to get the kernel updated. (:force t will recompile
FASLs even if they're newer than the corresponding source;
that's occasionally useful, but not really what you want here.)
If the kernel that you're running had its modified date change
by the rebuild process, it likely incorporates those changes. If
those changes didn't fix the problem, then I don't have a good
guess as to what the problem is: there aren't too many places
where the lisp calls into the threads library: it creates threads
and sends them signals via pthread_kill(). (There's another place
where a thread will send itself a signal via pthread_kill(),
but that is pretty much guaranteed to be a valid thread ...)

Post by Osei Poku
Thanks,
Osei

Post by Gary Byers

Post by Osei Poku
Hi,
It crashed again for me. This time I managed to grab the
contents of
/proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the
repository to
r9890-RC1.
Osei

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

Gary Byers

2008-07-18 16:45:59 UTC

Permalink

Ok... It happened again after recompiling the kernel. I managed to attach a
gdb session to the process and it is still running so I can possible provide
more feedback if you need. My current gdb session log is inserted below.

It basically shows that one thread is reading (from standard input)
and that all other threads are waiting for a semaphore that'll
allow them to wake from a suspended state.)

In other words, you're in the kernel debugger.

(gdb) info threads
9 Thread 0x40263950 (LWP 3271) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
8 Thread 0x404c7950 (LWP 3272) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
7 Thread 0x4072b950 (LWP 3305) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
6 Thread 0x4098f950 (LWP 3306) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
5 Thread 0x40bf3950 (LWP 3307) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
4 Thread 0x40e57950 (LWP 6093) 0x00002adafe591bfb in read () from
/lib64/libc.so.6
3 Thread 0x4131f950 (LWP 6094) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
2 Thread 0x410bb950 (LWP 6095) 0x00002adafe2ca2cb in sem_timedwait () from
/lib64/libpthread.so.0
1 Thread 0x2adafe820880 (LWP 3268) 0x00002adafe2ca2cb in sem_timedwait ()
from /lib64/libpthread.so.0

Thread 4 above is the one which got the exception, suspended other threads,
and is now trying to read a character in the kernel debugger.

To see you you got there, set a breakpoint at the (%rip) address where the
exception occured.

Before doing much of anything, tell GDB to ignore signals that the lisp
handles:

(gdb) source lisp-kernel/linuxx8664/.gdbinit

Then set the breakpoint, and "continue" (so that the kernel debugger
can run):

(gdb) br *0x00002ADAFE2CA325

(gdb) continue

In the kernel debugger, type X. Back in GDB, you'll have hit the
breakpoint (in some thread).

(gdb) info thread

If it's thread 4 (the one that entered the kernel debugger and was
in read() in the 'info threads' output above. If it's some other
thread ... well, that's -probably- not interesting (unless the other
thread gets an exception at the same place.)

Where are you (where is address *0x00002ADAFE2CA325) and how did
you get there ('bt' in GDB) ?

Osei Poku

2008-07-21 20:14:34 UTC

Permalink

Got it to crash again....

Post by Gary Byers
Where are you (where is address *0x00002ADAFE2CA325) and how did
you get there ('bt' in GDB) ?

This time %rip = 0x00002ABAFDCCD325. After I set the break point,
continued and typed X into the kernel debugger, I arrive here in gdb.
I will try not to screw up the debugging session like last time so
that I can provide additional information.

(gdb) bt
#0 0x00002abafdccd325 in sem_post () from /lib64/libpthread.so.0
#1 0x000000000041b3e2 in resume_tcr (tcr=0x40e577d0) at ../
thread_manager.c:1376
#2 0x000000000041c0ba in resume_other_threads (for_gc=<value
optimized out>) at ../thread_manager.c:1544
#3 0x000000000041d62e in lisp_Debugger (xp=0x4131dd60,
info=0x4131e110, why=11, in_foreign_code=1, message=0x4131db10
"Unhandled exception 11 at 0x2abafdccd325, context->regs at
#x4131dd88") at ../lisp-debug.c:919
#4 0x000000000041a2c6 in signal_handler (signum=11, info=0x4131e110,
context=0x4131dd60, tcr=0x4131f7d0, old_valence=1) at ../x86-
exceptions.c:1070
#5 <signal handler called>
#6 0x00002abafdccd325 in sem_post () from /lib64/libpthread.so.0
#7 0x000000000041b3e2 in resume_tcr (tcr=0x417e77d0) at ../
thread_manager.c:1376
#8 0x000000000041c146 in lisp_resume_tcr (tcr=0x417e77d0) at ../
thread_manager.c:1418
#9 0x000000000041a0c8 in handle_exception (signum=<value optimized
out>, info=0x4131eaa0, context=0x4131e6f0, tcr=0x4131f7d0,
old_valence=0) at ../x86-exceptions.c:910
#10 0x000000000041a218 in signal_handler (signum=4, info=0x4131eaa0,
context=0x4131e6f0, tcr=0x4131f7d0, old_valence=0) at ../x86-
exceptions.c:1064
#11 <signal handler called>
#12 0x00003000400110ab in ?? ()
#13 0x00003000404265fc in ?? ()
#14 0x000000000040e0ac in _SPnthrowvalues () at ../x86-spentry64.s:1404
#15 0x00002aaaad3e0110 in ?? ()
#16 0x0000000000000008 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) info threads
10 Thread 0x40263950 (LWP 6218) 0x00002abafdccd2cb in
sem_timedwait () from /lib64/libpthread.so.0
9 Thread 0x404c7950 (LWP 6219) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
8 Thread 0x4072b950 (LWP 6223) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
7 Thread 0x4098f950 (LWP 6224) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
6 Thread 0x40bf3950 (LWP 6225) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
5 Thread 0x410bb950 (LWP 8021) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
* 4 Thread 0x4131f950 (LWP 8307) 0x00002abafdccd325 in sem_post ()
from /lib64/libpthread.so.0
3 Thread 0x40e57950 (LWP 8308) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
2 Thread 0x41583950 (LWP 8309) 0x00002abafdccd2cb in sem_timedwait
() from /lib64/libpthread.so.0
1 Thread 0x2abafe223880 (LWP 6215) 0x00002abafdccd2cb in
sem_timedwait () from /lib64/libpthread.so.0
(gdb)

Gary Byers

2008-07-21 21:53:53 UTC

Permalink

If you still have the debugging session running, could you do:

(gdb) p/x *(TCR *)0x417e77d0

That address is the value of the "tcr" argument to "resume_tcr()" in
frame #7 in the backtrace below, so if you don't still have the
debugging session and reproduce the problem, we want to see what
the value of the "tcr" argument to resume_tcr() at the point was
at the point where resume_tcr() called sem_post() and crashed.

The gdb command above means "print, in hex, this contents of
what this address points to, interpreting that address as
being of type "pointer to TCR" (where a TCR is a "Thread Context
Record" that contains several interesting fields.)

'resume_tcr()' basically does 'sem_post(tcr->resume)', and a crash
would make sense if tcr->resume was NULL. If it was, then one of
the threads that's doing sem_timedwait() on its 'resume' semaphore
would presumably be waiting on a NULL semahore, and that doesn't
make sense.

Post by Osei Poku
Got it to crash again....

Post by Gary Byers
Where are you (where is address *0x00002ADAFE2CA325) and how did
you get there ('bt' in GDB) ?

This time %rip = 0x00002ABAFDCCD325. After I set the break point, continued
and typed X into the kernel debugger, I arrive here in gdb. I will try not
to screw up the debugging session like last time so that I can provide
additional information.
(gdb) bt
#0 0x00002abafdccd325 in sem_post () from /lib64/libpthread.so.0
#1 0x000000000041b3e2 in resume_tcr (tcr=0x40e577d0) at
../thread_manager.c:1376
#2 0x000000000041c0ba in resume_other_threads (for_gc=<value optimized out>)
at ../thread_manager.c:1544
#3 0x000000000041d62e in lisp_Debugger (xp=0x4131dd60, info=0x4131e110,
why=11, in_foreign_code=1, message=0x4131db10 "Unhandled exception 11 at
0x2abafdccd325, context->regs at #x4131dd88") at ../lisp-debug.c:919
#4 0x000000000041a2c6 in signal_handler (signum=11, info=0x4131e110,
context=0x4131dd60, tcr=0x4131f7d0, old_valence=1) at
../x86-exceptions.c:1070
#5 <signal handler called>
#6 0x00002abafdccd325 in sem_post () from /lib64/libpthread.so.0
#7 0x000000000041b3e2 in resume_tcr (tcr=0x417e77d0) at
../thread_manager.c:1376
#8 0x000000000041c146 in lisp_resume_tcr (tcr=0x417e77d0) at
../thread_manager.c:1418
#9 0x000000000041a0c8 in handle_exception (signum=<value optimized out>,
info=0x4131eaa0, context=0x4131e6f0, tcr=0x4131f7d0, old_valence=0) at
../x86-exceptions.c:910
#10 0x000000000041a218 in signal_handler (signum=4, info=0x4131eaa0,
context=0x4131e6f0, tcr=0x4131f7d0, old_valence=0) at
../x86-exceptions.c:1064
#11 <signal handler called>
#12 0x00003000400110ab in ?? ()
#13 0x00003000404265fc in ?? ()
#14 0x000000000040e0ac in _SPnthrowvalues () at ../x86-spentry64.s:1404
#15 0x00002aaaad3e0110 in ?? ()
#16 0x0000000000000008 in ?? ()
#17 0x0000000000000000 in ?? ()
(gdb) info threads
10 Thread 0x40263950 (LWP 6218) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
9 Thread 0x404c7950 (LWP 6219) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
8 Thread 0x4072b950 (LWP 6223) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
7 Thread 0x4098f950 (LWP 6224) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
6 Thread 0x40bf3950 (LWP 6225) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
5 Thread 0x410bb950 (LWP 8021) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
* 4 Thread 0x4131f950 (LWP 8307) 0x00002abafdccd325 in sem_post () from
/lib64/libpthread.so.0
3 Thread 0x40e57950 (LWP 8308) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
2 Thread 0x41583950 (LWP 8309) 0x00002abafdccd2cb in sem_timedwait () from
/lib64/libpthread.so.0
1 Thread 0x2abafe223880 (LWP 6215) 0x00002abafdccd2cb in sem_timedwait ()
from /lib64/libpthread.so.0
(gdb)

Osei Poku

2008-07-21 22:02:08 UTC

Permalink

Post by Gary Byers
(gdb) p/x *(TCR *)0x417e77d0

(gdb) p/x *(TCR *)0x417e77d0
$1 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f =
0x0}, linear = 0x0, save_rbp = 0x2aaaadd49ab0, lisp_mxcsr = 0x1920,
foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp =
0x2aaaadd49a58, save_tsp = 0x2aaaade5b000, foreign_sp = 0x417e6da0,
cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit = 0x415b6000,
bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x417e7770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x1847, last_allocptr = 0x3000455e0000,
save_allocptr = 0x3000455db200, save_allocbase = 0x3000455c0000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags =
0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding = 0x0,
tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0, next_tsp =
0x2aaaade5b000, safe_ref_address = 0x0}

To save your eyes scanning,

resume = 0x0

Post by Gary Byers
That address is the value of the "tcr" argument to "resume_tcr()" in
frame #7 in the backtrace below, so if you don't still have the
debugging session and reproduce the problem, we want to see what
the value of the "tcr" argument to resume_tcr() at the point was
at the point where resume_tcr() called sem_post() and crashed.
The gdb command above means "print, in hex, this contents of
what this address points to, interpreting that address as
being of type "pointer to TCR" (where a TCR is a "Thread Context
Record" that contains several interesting fields.)
'resume_tcr()' basically does 'sem_post(tcr->resume)', and a crash
would make sense if tcr->resume was NULL. If it was, then one of
the threads that's doing sem_timedwait() on its 'resume' semaphore
would presumably be waiting on a NULL semahore, and that doesn't
make sense.

Gary Byers

2008-07-21 22:42:00 UTC

Permalink

Thanks. Curiouser and curiouser, not only is the "resume" field 0,
but many other fields are as well, including 'next' and 'prev'. (TCR
structures are maintained in a circular, doubly-linked list; this guy
seems to have died and spliced himself out of that list.) Enough
fields are set that this looks like a dead thread rather than a
newly-created one.

The backtrace indicates that this was coming from
'lisp_resume_other_threads()", which is called as part of the expansion
of WITH-OTHER-THREADS-SUSPENDED. And lisp_resume_other_threads()
and lisp_suspend_other_threads() don't bother to grab and release
the lock which allows modification of the tcr list.

I'm not quite sure why what happened happened, but the code that
walks this doubly-linked list suspending and resuming threads should
be confident that other threads aren't splicing themselves on and off
that list while it's being walked.

Post by Gary Byers

Post by Gary Byers
(gdb) p/x *(TCR *)0x417e77d0

(gdb) p/x *(TCR *)0x417e77d0
$1 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f = 0x0},
linear = 0x0, save_rbp = 0x2aaaadd49ab0, lisp_mxcsr = 0x1920, foreign_mxcsr =
0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp = 0x2aaaadd49a58, save_tsp =
0x2aaaade5b000, foreign_sp = 0x417e6da0, cs_area = 0x0, vs_area = 0x0,
ts_area = 0x0, cs_limit = 0x415b6000, bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe = 0x0,
errno_loc = 0x417e7770, ffi_exception = 0x1f80, osid = 0x0, valence = 0x1,
foreign_exception_status = 0x0, native_thread_info = 0x0, native_thread_id =
0x1847, last_allocptr = 0x3000455e0000, save_allocptr = 0x3000455db200,
save_allocbase = 0x3000455c0000, reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0, pending_exception_context = 0x0,
suspend = 0x0, resume = 0x0, flags = 0x0, gc_context = 0x0,
termination_semaphore = 0x0, unwinding = 0x0, tlb_limit = 0x0, tlb_pointer =
0x0, shutdown_count = 0x0, next_tsp = 0x2aaaade5b000, safe_ref_address = 0x0}
To save your eyes scanning,
resume = 0x0

Osei Poku

2008-08-06 16:59:49 UTC

Permalink

This thing is not going away....
lisp debugger and gdb session below...

====lisp debugger
session
=
=
=
=
=
=
=
========================================================================

? exception in foreign context
Exception occurred while executing foreign code
? for help
[17455] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[17455] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x0000000000000000
%rcx = 0x0000000000000000 %r9 = 0x000000004072B7D0
%rdx = 0x0000000000000001 %r10 = 0x0000000000000008
%rbx = 0x00000000410BB7D0 %r11 = 0x0000000000000246
%rsp = 0x000000004072A218 %r12 = 0x000000004072B7D0
%rbp = 0x000000004072A6F0 %r13 = 0x000000004072A718
%rsi = 0x0000000000000001 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAA0
%rip = 0x00002B37EEFB3325 %rflags = 0x0000000000010246
[17455] OpenMCL kernel debugger: B

Framepointer [#x4072A6F0] in unknown area.
[17455] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7d0
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
Exception stack pointer = 0x4072a218
[17455] OpenMCL kernel debugger: X

====gdb
session
=
=
=
=
=
=
=
========================================================================

(gdb) source local/share/ccl/lisp-kernel/linuxx8664/.gdbinit
No symbol table is loaded. Use the "file" command.
(gdb) attach 17455
Attaching to process 17455
Reading symbols from /home/opoku/local/share/ccl/lx86cl64...done.
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2b37ef509880 (LWP 17455)]
[New Thread 0x40e57950 (LWP 17605)]
[New Thread 0x40bf3950 (LWP 17462)]
[New Thread 0x4098f950 (LWP 17461)]
[New Thread 0x4072b950 (LWP 17460)]
[New Thread 0x404c7950 (LWP 17459)]
[New Thread 0x40263950 (LWP 17458)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /usr/lib64/libssl.so...done.
Loaded symbols for /usr/lib64/libssl.so
Reading symbols from /usr/lib64/libcrypto.so.0.9.8...done.
Loaded symbols for /usr/lib64/libcrypto.so.0.9.8
Reading symbols from /lib64/libz.so.1...done.
Loaded symbols for /lib64/libz.so.1
0x00002b37eefb32cb in sem_timedwait () from /lib64/libpthread.so.0
(gdb) source local/share/ccl/lisp-kernel/linuxx8664/.gdbinit
Redefine command "x86_lisp_string"? (y or n) [answered Y; input not
from terminal]
Redefine command "gtra"? (y or n) [answered Y; input not from terminal]
Redefine command "x86pname"? (y or n) [answered Y; input not from
terminal]
Redefine command "pname"? (y or n) [answered Y; input not from terminal]
Redefine command "l"? (y or n) [answered Y; input not from terminal]
Redefine command "lw"? (y or n) [answered Y; input not from terminal]
Redefine command "clobber_breakpoint"? (y or n) [answered Y; input not
from terminal]
Redefine command "arg_z"? (y or n) [answered Y; input not from terminal]
Redefine command "arg_y"? (y or n) [answered Y; input not from terminal]
Redefine command "arg_x"? (y or n) [answered Y; input not from terminal]
Redefine command "bx"? (y or n) [answered Y; input not from terminal]
Redefine command "showlist"? (y or n) [answered Y; input not from
terminal]
Redefine command "lbt"? (y or n) [answered Y; input not from terminal]
Redefine command "ada"? (y or n) [answered Y; input not from terminal]
Redefine command "lregs"? (y or n) [answered Y; input not from terminal]
Breakpoint 1 at 0x41d780: file ../lisp-debug.c, line 934.
(gdb) br *0x00002B37EEFB3325
Breakpoint 2 at 0x2b37eefb3325
(gdb) continue
Continuing.
[Switching to Thread 0x4072b950 (LWP 17460)]

Breakpoint 2, 0x00002b37eefb3325 in sem_post () from /lib64/
libpthread.so.0
2: x/i $pc
0x2b37eefb3325 <sem_post+5>: lock xadd %edx,(%rdi)
(gdb) info thread
7 Thread 0x40263950 (LWP 17458) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
6 Thread 0x404c7950 (LWP 17459) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
* 5 Thread 0x4072b950 (LWP 17460) 0x00002b37eefb3325 in sem_post ()
from /lib64/libpthread.so.0
4 Thread 0x4098f950 (LWP 17461) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
3 Thread 0x40bf3950 (LWP 17462) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
2 Thread 0x40e57950 (LWP 17605) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
1 Thread 0x2b37ef509880 (LWP 17455) 0x00002b37eefb32cb in
sem_timedwait () from /lib64/libpthread.so.0
(gdb) bt
#0 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
#1 0x000000000041b3e2 in resume_tcr (tcr=0x4098f7d0) at ../
thread_manager.c:1376
#2 0x000000000041bbea in resume_other_threads (for_gc=<value
optimized out>) at ../thread_manager.c:1546
#3 0x000000000041d66e in lisp_Debugger (xp=0x404d7e40, info=0x0,
why=-1, in_foreign_code=1, message=0x404d7b30 "exception in foreign
context") at ../lisp-debug.c:919
#4 0x000000000041d757 in FBug (xp=0x404d7e40, format=<value optimized
out>) at ../lisp-debug.c:954
#5 0x0000000000418a07 in altstack_signal_handler (signum=11,
info=0x404d7f70, context=0x404d7e40) at ../x86-exceptions.c:1287
#6 <signal handler called>
#7 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
#8 0x000000000041b3e2 in resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1376
#9 0x000000000041bf56 in lisp_resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1418
#10 0x000000000041a0c8 in handle_exception (signum=<value optimized
out>, info=0x4072aaa0, context=0x4072a6f0, tcr=0x4072b7d0,
old_valence=0) at ../x86-exceptions.c:910
#11 0x000000000041a218 in signal_handler (signum=4, info=0x4072aaa0,
context=0x4072a6f0, tcr=0x4072b7d0, old_valence=0) at ../x86-
exceptions.c:1064
#12 <signal handler called>
#13 0x00003000400110ab in ?? ()
#14 0x0000300040431a5c in ?? ()
#15 0x000000000040e2b4 in _SPnthrow1value () at ../x86-spentry64.s:1516
#16 0x00002aaaab300ff0 in ?? ()
#17 0x0000000000000008 in ?? ()
#18 0xfffffffffffffff8 in ?? ()
#19 0x00002aaaab301110 in ?? ()
#20 0x0000000000000008 in ?? ()
#21 0x0000000000000000 in ?? ()
(gdb) p/x *(TCR *) 0x4098f7d0
$1 = {next = 0x40bf37d0, prev = 0x4072b7d0, single_float_convert =
{tag = 0x1, f = 0x0}, linear = 0x4098f7d0, save_rbp = 0x2aaaab624c40,
lisp_mxcsr = 0x1920, foreign_mxcsr = 0x1f80, db_link = 0x2aaaab624dc0,
catch_top = 0x2aaaab73599d, save_vsp = 0x2aaaab624be8, save_tsp =
0x2aaaab7358e0, foreign_sp = 0x4098ed60, cs_area = 0x6647e0, vs_area =
0x660610, ts_area = 0x6606f0,
cs_limit = 0x4075e000, bytes_allocated = 0x58ef60,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe = 0x0,
errno_loc = 0x4098f770, ffi_exception = 0x1f80, osid = 0x4098f950,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x3000447c0000,
save_allocptr = 0x3000447bd1a0,
save_allocbase = 0x3000447a0000, reset_completion = 0x660550,
activate = 0x660580, suspend_count = 0x0, suspend_context =
0x4098e6f0, pending_exception_context = 0x0, suspend = 0x6604f0,
resume = 0x660520, flags = 0x0, gc_context = 0x0,
termination_semaphore = 0x664990, unwinding = 0x0, tlb_limit = 0x4000,
tlb_pointer = 0x6607a0, shutdown_count = 0x4,
next_tsp = 0x2aaaab7358e0, safe_ref_address = 0x0}
(gdb) p/x *(TCR *) 0x410bb7d0
$2 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f =
0x0}, linear = 0x0, save_rbp = 0x2aaaac7d0ab0, lisp_mxcsr = 0x1920,
foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp =
0x2aaaac7d0a58, save_tsp = 0x2aaaac8e2000, foreign_sp = 0x410bada0,
cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit = 0x40e8a000,
bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x410bb770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x300044840000,
save_allocptr = 0x30004483c730, save_allocbase = 0x300044820000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags =
0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding = 0x0,
tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0, next_tsp =
0x2aaaac8e2000, safe_ref_address = 0x0}
(gdb) down
Bottom (innermost) frame selected; you cannot go down.
(gdb) up
#1 0x000000000041b3e2 in resume_tcr (tcr=0x4098f7d0) at ../
thread_manager.c:1376
1376 ../thread_manager.c: No such file or directory.
in ../thread_manager.c
(gdb) up
#2 0x000000000041bbea in resume_other_threads (for_gc=<value
optimized out>) at ../thread_manager.c:1546
1546 in ../thread_manager.c
(gdb) up
#3 0x000000000041d66e in lisp_Debugger (xp=0x404d7e40, info=0x0,
why=-1, in_foreign_code=1, message=0x404d7b30 "exception in foreign
context") at ../lisp-debug.c:919
919 ../lisp-debug.c: No such file or directory.
in ../lisp-debug.c
(gdb) up
#4 0x000000000041d757 in FBug (xp=0x404d7e40, format=<value optimized
out>) at ../lisp-debug.c:954
954 in ../lisp-debug.c
(gdb) up
#5 0x0000000000418a07 in altstack_signal_handler (signum=11,
info=0x404d7f70, context=0x404d7e40) at ../x86-exceptions.c:1287
1287 ../x86-exceptions.c: No such file or directory.
in ../x86-exceptions.c
(gdb) up
#6 <signal handler called>
(gdb) up
#7 0x00002b37eefb3325 in sem_post () from /lib64/libpthread.so.0
(gdb) up
#8 0x000000000041b3e2 in resume_tcr (tcr=0x410bb7d0) at ../
thread_manager.c:1376
1376 ../thread_manager.c: No such file or directory.
in ../thread_manager.c
(gdb) p/x *(TCR *) 0x410bb7d0
$3 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f =
0x0}, linear = 0x0, save_rbp = 0x2aaaac7d0ab0, lisp_mxcsr = 0x1920,
foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0, save_vsp =
0x2aaaac7d0a58, save_tsp = 0x2aaaac8e2000, foreign_sp = 0x410bada0,
cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit = 0x40e8a000,
bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x410bb770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x442f, last_allocptr = 0x300044840000,
save_allocptr = 0x30004483c730, save_allocbase = 0x300044820000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags =
0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding = 0x0,
tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0, next_tsp =
0x2aaaac8e2000, safe_ref_address = 0x0}
(gdb)

Post by Gary Byers
Thanks. Curiouser and curiouser, not only is the "resume" field 0,
but many other fields are as well, including 'next' and 'prev'. (TCR
structures are maintained in a circular, doubly-linked list; this guy
seems to have died and spliced himself out of that list.) Enough
fields are set that this looks like a dead thread rather than a
newly-created one.
The backtrace indicates that this was coming from
'lisp_resume_other_threads()", which is called as part of the
expansion
of WITH-OTHER-THREADS-SUSPENDED. And lisp_resume_other_threads()
and lisp_suspend_other_threads() don't bother to grab and release
the lock which allows modification of the tcr list.
I'm not quite sure why what happened happened, but the code that
walks this doubly-linked list suspending and resuming threads should
be confident that other threads aren't splicing themselves on and off
that list while it's being walked.

Post by Gary Byers

Post by Gary Byers
(gdb) p/x *(TCR *)0x417e77d0

(gdb) p/x *(TCR *)0x417e77d0
$1 = {next = 0x0, prev = 0x0, single_float_convert = {tag = 0x1, f
= 0x0}, linear = 0x0, save_rbp = 0x2aaaadd49ab0, lisp_mxcsr =
0x1920, foreign_mxcsr = 0x1f80, db_link = 0x0, catch_top = 0x0,
save_vsp = 0x2aaaadd49a58, save_tsp = 0x2aaaade5b000, foreign_sp =
0x417e6da0, cs_area = 0x0, vs_area = 0x0, ts_area = 0x0, cs_limit =
0x415b6000, bytes_allocated = 0x0,
log2_allocation_quantum = 0x11, interrupt_pending = 0x0, xframe =
0x0, errno_loc = 0x417e7770, ffi_exception = 0x1f80, osid = 0x0,
valence = 0x1, foreign_exception_status = 0x0, native_thread_info =
0x0, native_thread_id = 0x1847, last_allocptr = 0x3000455e0000,
save_allocptr = 0x3000455db200, save_allocbase = 0x3000455c0000,
reset_completion = 0x0, activate = 0x0,
suspend_count = 0x0, suspend_context = 0x0,
pending_exception_context = 0x0, suspend = 0x0, resume = 0x0, flags
= 0x0, gc_context = 0x0, termination_semaphore = 0x0, unwinding =
0x0, tlb_limit = 0x0, tlb_pointer = 0x0, shutdown_count = 0x0,
next_tsp = 0x2aaaade5b000, safe_ref_address = 0x0}
To save your eyes scanning,
resume = 0x0

Wade Humeniuk

2008-08-10 17:19:59 UTC

Permalink

Maybe a hardware problem with your computer? Could
be faulty RAM/Processor/Motherboard..... You said this problem is
happening on a
particular machine. Perhaps running some diagnostics might show up something
(though I have no suggestions what that diagnostic program might be.)

Wade

Post by Osei Poku
This thing is not going away....
lisp debugger and gdb session below...
====lisp debugger
session
=
=
=
=
=
=
=
========================================================================
? exception in foreign context
Exception occurred while executing foreign code
? for help
[17455] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[17455] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x0000000000000000
%rcx = 0x0000000000000000 %r9 = 0x000000004072B7D0
%rdx = 0x0000000000000001 %r10 = 0x0000000000000008
%rbx = 0x00000000410BB7D0 %r11 = 0x0000000000000246
%rsp = 0x000000004072A218 %r12 = 0x000000004072B7D0
%rbp = 0x000000004072A6F0 %r13 = 0x000000004072A718
%rsi = 0x0000000000000001 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAA0
%rip = 0x00002B37EEFB3325 %rflags = 0x0000000000010246
[17455] OpenMCL kernel debugger: B
Framepointer [#x4072A6F0] in unknown area.
[17455] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7d0
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
Exception stack pointer = 0x4072a218
[17455] OpenMCL kernel debugger: X

Gary Byers

2008-08-10 20:14:43 UTC

Permalink

I've said this (and been wrong) a few times already, but I think that
I (partly) fixed this in svn a few days ago. (Or at least fixed the
part that led to the crash.)

Some things that try to examine the status of a process (PROCESS-WHOSTATE)
do so by briefly suspending and resuming the process. Unfortunately,
the code that does this doesn't reliably ensure that the thread
hasn't exited before we try to suspend it, and trying to (unconditionally)
resume a thread that exited before it was suspended can wind up trying
to signal a NULL semaphore (which is the symptom that Osei is seeing.)

That's sort of a perfect storm of everyhing that could go wrong
going wrong at the same time. I'm not 100% sure that PROCESS-WHOSTATE
is the culprit; there's at least one other thing (SYMBOL-VALUE-IN-PROCESS)
that does similar things and has similar race conditions that it doesn't
handle.

Whatever the culprit(s) is or are, there are ways to reach the C
function 'resume_tcr()' in the lisp kernel, and that function can
afford to check to see if the semaphore that it's going to signal
is NULL before blindly signaling it. (Not checking - on Linux,
at least - leads to the crash that Osei's seeing.)

If you do:

? (process-run-function "do nothing" (lambda ()))

in the listener, you'll probably see the result print as something
like:

#<PROCESS do nothing(9) [Exhausted] #x1058B2ACC>

which basically means that there's no underlying OS-level thread
associated with the process anymore (the process's initial function
exited by the time the PRINT-OBJECT method was called to print the
result in the REPL.

Depending on the whims of the scheduler, there's a small chance
that the process could print with a WHOSTATE of "Active" (if the
function was a little less trivial or if the thread didn't get
scheduled before the listener thread tried to deternine its state.)

I think that there's an even smaller chance that between the time
that PROCESS-WHOSTATE checks for the "exhausted" case and the
time that it does the suspend/resume the process could basically
become "exhausted" (the underlying thread could exit), and resuming
a thread that's exited has caused a NULL semaphore to be raised.
Code that creates and prints a lot of short-lived threads could
run into that timing screw, as could other things that suspend/
resume threads sloppily (SYMBOL-VALUE-IN-PROCESS, :PROC, etc.)

The NULL semaphore problem should be fixed in SVN; there are a few
other bits of sloppiness there that need some more work. I've never
seen this happen (and the PROCESS-WHOSTATE/SYMBOL-VALUE-IN-PROCESS
idea is partly a guess), but someone else reported the same crash
(the NULL semaphore) a few days ago. It might be a little sensitive
to CPU speed/number of cores/scheduler details, but I believe that
this could happen without a hardware problem being involved.)

Post by Wade Humeniuk
Maybe a hardware problem with your computer? Could
be faulty RAM/Processor/Motherboard..... You said this problem is
happening on a
particular machine. Perhaps running some diagnostics might show up something
(though I have no suggestions what that diagnostic program might be.)
Wade

Osei Poku

2008-08-18 18:42:05 UTC

Permalink

Just a quick report... I updated to r10465M-RC1 and have had no
crashes yet. So I'm keeping my fingers crossed :)

Something else strange happened (the same day I updated), where it was
not in the debugger but I could not evaluate any forms both in emacs/
slime and in the plain tty repl. It hasn't happened again since then
so I think I probably screwed something up.

Anyhow, thanks for all the help tracking down this issue and improving
the situation. I was this ( || ) close to ponying up a few thousand
bucks for LW64 :)

Osei

Post by Gary Byers
I've said this (and been wrong) a few times already, but I think that
I (partly) fixed this in svn a few days ago. (Or at least fixed the
part that led to the crash.)
Some things that try to examine the status of a process (PROCESS-
WHOSTATE)
do so by briefly suspending and resuming the process. Unfortunately,
the code that does this doesn't reliably ensure that the thread
hasn't exited before we try to suspend it, and trying to
(unconditionally)
resume a thread that exited before it was suspended can wind up trying
to signal a NULL semaphore (which is the symptom that Osei is seeing.)
That's sort of a perfect storm of everyhing that could go wrong
going wrong at the same time. I'm not 100% sure that PROCESS-WHOSTATE
is the culprit; there's at least one other thing (SYMBOL-VALUE-IN-
PROCESS)
that does similar things and has similar race conditions that it doesn't
handle.
Whatever the culprit(s) is or are, there are ways to reach the C
function 'resume_tcr()' in the lisp kernel, and that function can
afford to check to see if the semaphore that it's going to signal
is NULL before blindly signaling it. (Not checking - on Linux,
at least - leads to the crash that Osei's seeing.)
? (process-run-function "do nothing" (lambda ()))
in the listener, you'll probably see the result print as something
#<PROCESS do nothing(9) [Exhausted] #x1058B2ACC>
which basically means that there's no underlying OS-level thread
associated with the process anymore (the process's initial function
exited by the time the PRINT-OBJECT method was called to print the
result in the REPL.
Depending on the whims of the scheduler, there's a small chance
that the process could print with a WHOSTATE of "Active" (if the
function was a little less trivial or if the thread didn't get
scheduled before the listener thread tried to deternine its state.)
I think that there's an even smaller chance that between the time
that PROCESS-WHOSTATE checks for the "exhausted" case and the
time that it does the suspend/resume the process could basically
become "exhausted" (the underlying thread could exit), and resuming
a thread that's exited has caused a NULL semaphore to be raised.
Code that creates and prints a lot of short-lived threads could
run into that timing screw, as could other things that suspend/
resume threads sloppily (SYMBOL-VALUE-IN-PROCESS, :PROC, etc.)
The NULL semaphore problem should be fixed in SVN; there are a few
other bits of sloppiness there that need some more work. I've never
seen this happen (and the PROCESS-WHOSTATE/SYMBOL-VALUE-IN-PROCESS
idea is partly a guess), but someone else reported the same crash
(the NULL semaphore) a few days ago. It might be a little sensitive
to CPU speed/number of cores/scheduler details, but I believe that
this could happen without a hardware problem being involved.)

Post by Osei Poku
This thing is not going away....
lisp debugger and gdb session below...
====lisp debugger
session
=
=
=
=
=
=
=
=
=
=
=
====================================================================
? exception in foreign context
Exception occurred while executing foreign code
? for help
[17455] OpenMCL kernel debugger: ?
(G) Set specified GPR to new value
(R) Show raw GPR/SPR register values
(L) Show Lisp values of tagged registers
(F) Show FPU registers
(S) Find and describe symbol matching specified name
(B) Show backtrace
(T) Show info about current thread
(X) Exit from this debugger, asserting that any exception was handled
(K) Kill OpenMCL process
(?) Show this help
[17455] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x0000000000000000
%rcx = 0x0000000000000000 %r9 = 0x000000004072B7D0
%rdx = 0x0000000000000001 %r10 = 0x0000000000000008
%rbx = 0x00000000410BB7D0 %r11 = 0x0000000000000246
%rsp = 0x000000004072A218 %r12 = 0x000000004072B7D0
%rbp = 0x000000004072A6F0 %r13 = 0x000000004072A718
%rsi = 0x0000000000000001 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x000000004072AAA0
%rip = 0x00002B37EEFB3325 %rflags = 0x0000000000010246
[17455] OpenMCL kernel debugger: B
Framepointer [#x4072A6F0] in unknown area.
[17455] OpenMCL kernel debugger: T
Current Thread Context Record (tcr) = 0x4072b7d0
Control (C) stack area: low = 0x404d8000, high = 0x4072c000
Value (lisp) stack area: low = 0x2aaaab0f1000, high = 0x2aaaab302000
Exception stack pointer = 0x4072a218
[17455] OpenMCL kernel debugger: X

Alexander Repenning

2008-11-10 19:04:23 UTC

Permalink

this may have been discussed in some other context but I cannot find
any trace. Anyway, while usually pretty stable CCL 1.2 (mac) works
well with Cocoa in general and even reports, without crashing on some
memory management issues. But once in a while CCL really does crash
but unfortunately without creating a crashlog file. What is missing? I
have

COREDUMPS=-YES-

in etc/hostconfig

but when getting a Nov 10 11:54:15 Ristretto-to-Go-7
com.apple.launchd[67] ([0x0-0x15015].com.clozure.Clozure CL[119]):
Exited: Killed

there is no crash.log

Am I missing something?

all the best, Alex

Prof. Alexander Repenning

University of Colorado
Computer Science Department
Boulder, CO 80309-430

vCard: http://www.cs.colorado.edu/~ralex/AlexanderRepenning.vcf

Hans Hübner

2008-11-10 19:12:16 UTC

Permalink

On Mon, Nov 10, 2008 at 20:04, Alexander Repenning

this may have been discussed in some other context but I cannot find any
trace. Anyway, while usually pretty stable CCL 1.2 (mac) works well with
Cocoa in general and even reports, without crashing on some memory
management issues. But once in a while CCL really does crash but
unfortunately without creating a crashlog file. What is missing? I have
COREDUMPS=-YES-
in etc/hostconfig
but when getting a Nov 10 11:54:15 Ristretto-to-Go-7 com.apple.launchd[67]
([0x0-0x15015].com.clozure.Clozure CL[119]): Exited: Killed
there is no crash.log

To me, this looks as if the operating system has killed the CCL
process, presumably because of a swap space shortage. Check your
'messages' file (presumably /var/log/messages, but could be somewhere
else on Mac OS X) for "out of swap space" messages?

-Hans

Gary Byers

2008-11-10 20:57:52 UTC

Permalink

If you're asking "what should have been logged somewhere but wasn't?",
I don't know. (That's kind of like a Zen koan, only instead of
achieving enlightenment by contemplating it you wind up with a bad
headache.)

If lisp code does something that results in an illegal memory reference,
the lisp kernel catches the resulting exception and signals a lisp error.

? (%get-byte (%null-ptr))

Error: Fault during read of memory address #x0
While executing: %GET-BYTE, in process Listener(5).
Type :POP to abort, :R for a list of available restarts.
Type :? for other options.

1

For a simple case like this, we can slap ourselves in the forehead
(figuratively ...) and remind ourselves not to dereference obviously
null pointers. Even in more realistic cases, it may be easy to figure
out what caused the memory fault and convince ourselves that the
damage was localized. (In general, it's possible to scribble randomly
over memory for a while before we try to write to an address that'll
cause a fault, so if we don't understand what caused a memory fault
like this we should view the lisp session with suspicion: if something's
doing incorrect memory accesses, it might have overwritten something
important before writing to an address that caused a fault.) From
the lisp kernel's point of view, trying to report this as a lisp error
is "worth a try", and it often works well in practice.

If foreign (C) code does an invalid memory access, it's much harder to
know how to recover from that: we don't know what state that foreign
code may have changed and we don't know what the consequences of
signaling a lisp error in the middle of some unknown foreign code
might be. (E.g., if we get a fault in the middle of #_malloc or
something similar, trying to signal a lisp error at that point might
just lead to a lot of secondary problems and not get very far.)

When any kind of unhandled exception (memory fault or other) happens
in foreign code, the lisp enters its kernel debugger. It's not much
of a debugger, and what there is of it is oriented towards printing
lisp objects (with varying degrees of success ...) and lisp
backtraces. There's a little information in the Wiki about debugging
under GDB:

<http://trac.clozure.com/openmcl/wiki/CclUnderGdb>

but it's probably fair to say that trying to figure out how/why
some foreign code crashed can be a hard problem. (Many great
minds have spent countless hours on this problem ...)

If we're running the lisp as a non-OSX-GUI application and we
do something like:

? (ff-call (%null-ptr) :void)

we get:

Unhandled exception 10 at 0x0, context->regs at #xb029b8f0
Exception occurred while executing foreign code
? for help
[50778] OpenMCL kernel debugger:

Well, yes: we did a foreign function call to an invalid address,
and now we're pretty much stuck. In a more realistic example -
where we were in some real foreign code and that code caused a
fault - the kernel debugger will try to print the name of a
known foreign function whose address is near the PC at the time
of the exception.

We can ask the kernel debugger to show us the values of the machine
registers (x8664 in this case):

[50778] OpenMCL kernel debugger: r
%rax = 0x0000000000000000 %r8 = 0x000000000000031a
%rcx = 0x00000000006a5a30 %r9 = 0x00000000001047f0
%rdx = 0x00000000b029bde0 %r10 = 0x00003000400090f4
%rbx = 0x0000000000104be0 %r11 = 0x0000000000000000
%rsp = 0x00000000b029bdc8 %r12 = 0x0000000000000000
%rbp = 0x00000000b029bdd0 %r13 = 0x0000000000000000
%rsi = 0x0000000000000200 %r14 = 0x000000000001300b
%rdi = 0x00000000001047c0 %r15 = 0x0000000000000200
%rip = 0x0000000000000000 %rflags = 0x00010206

which shows us that %rip (the instruction pointer/program counter) is
at address 0, and if we try to get a lisp backtrace at this point
we can see how we got here (this may or may not work in 1.2):

(#x00000000006A5A58) #x000030004000821C : #<Function %DO-FF-CALL #x00003000400081CF> + 77
(#x00000000006A5A68) #x00003000400090F4 : #<Function %FF-CALL #x00003000400082CF> + 3621
(#x00000000006A5AE0) #x00003000404C5A84 : #<Function CALL-CHECK-REGS #x00003000404C599F> + 229
(#x00000000006A5B18) #x00003000404BCA9C : #<Function TOPLEVEL-EVAL #x00003000404BC7BF> + 733
(#x00000000006A5BB8) #x00003000404BEB0C : #<Function READ-LOOP #x00003000404BE3EF> + 1821
(#x00000000006A5DD8) #x00003000404C556C : #<Function TOPLEVEL-LOOP #x00003000404C54EF> + 125

from which we -might- be able to conclude that FF-CALLing a null pointer
is a bad idea. (This example may not convince anyone who's skeptical
of my assertion that it's hard to reliably recover from an exception
in foreign code; I honestly do think that that's a hard problem.)

The kernel debugger just writes to the (Unix) process-level standard error
descriptor and reads from the process's standard input.

An OSX's GUI application's standard I/O descriptors are ordinarily
redirected: input usually comes from /dev/null (the null device, which
always returns EOF on input) and output and error (supposedly) go to a
logfile somewhere. (On Leopard, "somewhere" seems to be
/private/tmp.) It's probably the case that we get the EOF (reading
from /dev/null) before anything's actually flushed to that logfile
when the kernel debugger's entered from the IDE.

While waiting for someone to figure out what to do about that ...
you can run a GUI application in Terminal (or equivalent); when
it's run this way, its standard I/O file descriptors remain unchanged
(and therefore the kernel debugger works.) The general idea is
to invoke the executable program inside the .app bundle:

shell> /path/to/Clozure\ CL.app/Contents/MacOS/dx86cl64

The good news is that that'll leave standard I/O attached to the
"terminal" (or Emacs shell buffer, or ...) and it's possible to
interact with the kernel debugger (and entering the kernel debugger
won't cause the lisp to exit unless/until it gets an EOF when
reading from standard input). The bad news is that the standard
error of a GUI application often gets filled with diagnostic
messages that are probably more meaningful to whoever wrote them
than to anyone else, and the fact that that the kernel debugger
is better than nothing doesn't mean that it's a whole lot better
than nothing ...

There are a variety of reasons why Apple's Crash Reporter doesn't
get invoked in this case (they're related to the reasons why it
sometimes gets invoked whenever some lisps get exceptions that
they routinely handle.) If it were invoked, it wouldn't be
able to make a whole lot of sense out of the lisp-specific side
of things. (If lisp crashes generated Crash Reporter logs, I
wouldn't often find them very useful and I doubt if other people
would, either.) Generating someting somewhat like a crash
reporter log would be useful (even if that's equivalent to
having the kernel debugger invoke as many of its options as
might be useful and save the output somewhere.) Just exiting
on EOF because the EOF comes from /dev/null is probably less
useful.

In the short term, running the IDE from the terminal might be enough
to let the kernel debugger point you in the general direction of the
problem.

this may have been discussed in some other context but I cannot find any
trace. Anyway, while usually pretty stable CCL 1.2 (mac) works well with
Cocoa in general and even reports, without crashing on some memory management
issues. But once in a while CCL really does crash but unfortunately without
creating a crashlog file. What is missing? I have
COREDUMPS=-YES-
in etc/hostconfig
but when getting a Nov 10 11:54:15 Ristretto-to-Go-7 com.apple.launchd[67]
([0x0-0x15015].com.clozure.Clozure CL[119]): Exited: Killed
there is no crash.log
Am I missing something?
all the best, Alex
Prof. Alexander Repenning
University of Colorado
Computer Science Department
Boulder, CO 80309-430
vCard: http://www.cs.colorado.edu/~ralex/AlexanderRepenning.vcf

Alexander Repenning

2008-11-13 23:57:33 UTC

Permalink

It is simple to write little code producing a LOT of documentation
with Lisp. The trivial little hack below produces documentation for
the entire CL class tree. Especially when classes
include :documentation even the code below is somewhat useful.
However, my real question is this. Is anybody aware of some Java DOC-
like tool for Lisp? That is, something like that thing below but with
formated output (e.g., HTML with style sheet)? It would seem to be
such an obvious and simple thing to do in Lisp that I would assume it
already exists?

Alex

;; ---- lisp-doc.lisp -------

(in-package :ccl)

(defparameter *Classes-Documented* (make-hash-table))

(defun RENDER-CLASS-DOC (Classes &optional (Level 0))
(when (not (listp Classes))
(render-class-doc (list Classes) Level)
(return-from render-class-doc))
(when (= Level 0) (setf *Classes-Documented* (make-hash-table)))
(dolist (Class Classes)
(when (symbolp Class) (setf Class (find-class Class)))
(unless (gethash (slot-value Class 'name) *Classes-Documented*)
(setf (gethash (slot-value Class 'name) *Classes-Documented*)
Class)
(dotimes (I (* 2 Level)) (princ #\space))
(princ (slot-value Class 'name))
(let ((Documentation (documentation (slot-value Class 'name)
'type)))
(when Documentation
(format t ": ~A" Documentation)))
(terpri)
(let ((Slot-Names (mapcar #'slot-definition-name (class-direct-
slots Class))))
(when Slot-Names
(dotimes (I (* 2 (+ Level 2))) (princ #\space))
(format t "slots: ")
(dolist (Slot-Name (butlast Slot-Names))
(format t "~:(~A~), " Slot-Name))
(format t "~:(~A~)" (first (last Slot-Names)))
(terpri)))
(render-class-doc (slot-value Class 'direct-subclasses) (1+
Level)))))

#| Examples:

(render-class-doc 'number) ;; no much :documentation here in CCL

(render-class-doc 't) ;; same here

|#

R. Matthew Emerson

2008-11-14 00:42:29 UTC

Permalink

You might look at the links at the following page:

http://www.cliki.net/Documentation%20tool

(I don't use any of them, so I can't really offer any opinions on
which ones are good/bad.)

Robert Goldman

2008-11-14 10:35:08 UTC

Permalink

Edi Weitz has developed an asdf package, documentation-template
(http://www.weitz.de/documentation-template/), that grovels over the
symbols of a package and assembles them into an HTML manual. We have
used it at my company, because it's simple, but it's not very general
--- it's very tailored to Edi's own uses, and he's not able to support
it. We have modified it to be more general (e.g., allow for different
licenses, different download instructions, apply an arbitrary function
to filter the symbols whose documentation is to be incorporated, etc.),
but haven't released our changes. We could probably be persuaded to do
so, if anyone was interested.

Gary King has a much more ambitious doc tool, but it relies on a very
large tree of software libraries. We have been too cautious to use it
for that reason.

One thing that would be nice would be if we had a markup language (e.g.,
Markdown, texinfo) to use in documentation strings that would be
readable by just invoking DOCUMENTATION, but that could be postprocessed
to support hyperlinks and rudimentary text attributes.

Best,
Robert

Joshua TAYLOR

2008-11-14 14:14:34 UTC

Permalink

For smaller projects, Edi's documentation template is fairly nice, but
it usually requires a fair amount of modification afterward, e.g., if
you want the table of contents to be something other than alphabetic.
If you want something a bit more like JavaDoc, you might try CLDOC
(http://common-lisp.net/project/cldoc/):

"Unlike Albert it does not allow programmers to insert comments at
the source code level which are incorporated into the generated
documentation. Its goal was not to produce a LispDoc ala JavaDoc but
to create a simple and easy way to take advantage of the Lisp
documentation strings. So instead of copying and pasting it in some
commentary section with extra special documentation tool markup stuff,
the idea was to find an elegant way of parsing the doc string. "

I do recognize that I compared it Javadoc, and that they point out
that it's /not/ "ala JavaDoc", but between the style it encourages in
docstrings and the HTML output, I think there are some significant
similarities.

The CLDOC documentation (generated by CLDOC, so it's an example) is
available at http://common-lisp.net/project/cldoc/HTMLdoc/ .

//JT
(I have no affiliation with CLDOC, but I've used it in the past and
have been rather happy with the results.)

On Thu, Nov 13, 2008 at 6:57 PM, Alexander Repenning

Post by Alexander Repenning
It is simple to write little code producing a LOT of documentation
with Lisp. The trivial little hack below produces documentation for
the entire CL class tree. Especially when classes
include :documentation even the code below is somewhat useful.
However, my real question is this. Is anybody aware of some Java DOC-
like tool for Lisp? That is, something like that thing below but with
formated output (e.g., HTML with style sheet)? It would seem to be
such an obvious and simple thing to do in Lisp that I would assume it
already exists?
Alex
;; ---- lisp-doc.lisp -------
(in-package :ccl)
(defparameter *Classes-Documented* (make-hash-table))
(defun RENDER-CLASS-DOC (Classes &optional (Level 0))
(when (not (listp Classes))
(render-class-doc (list Classes) Level)
(return-from render-class-doc))
(when (= Level 0) (setf *Classes-Documented* (make-hash-table)))
(dolist (Class Classes)
(when (symbolp Class) (setf Class (find-class Class)))
(unless (gethash (slot-value Class 'name) *Classes-Documented*)
(setf (gethash (slot-value Class 'name) *Classes-Documented*)
Class)
(dotimes (I (* 2 Level)) (princ #\space))
(princ (slot-value Class 'name))
(let ((Documentation (documentation (slot-value Class 'name)
'type)))
(when Documentation
(format t ": ~A" Documentation)))
(terpri)
(let ((Slot-Names (mapcar #'slot-definition-name (class-direct-
slots Class))))
(when Slot-Names
(dotimes (I (* 2 (+ Level 2))) (princ #\space))
(format t "slots: ")
(dolist (Slot-Name (butlast Slot-Names))
(format t "~:(~A~), " Slot-Name))
(format t "~:(~A~)" (first (last Slot-Names)))
(terpri)))
(render-class-doc (slot-value Class 'direct-subclasses) (1+
Level)))))
(render-class-doc 'number) ;; no much :documentation here in CCL
(render-class-doc 't) ;; same here
|#
_______________________________________________
Openmcl-devel mailing list
http://clozure.com/mailman/listinfo/openmcl-devel

--
=====================
Joshua Taylor
***@cs.rpi.edu, ***@alum.rpi.edu

"In the Mountains of New Hampshire,
God Almighty has hung out a sign
to show that there He makes men."
Daniel Webster

"A lot of good things went down one time,
back in the goodle days."
John Hartford

Daniel Dickison

2008-11-14 15:26:08 UTC

Permalink

There is one called Tinaa by Gary King (http://metabang.com), which
works at the ASDF level to document a system and its ASDF
dependencies. I've used it before and it's quite nice. If you have
CL-Markdown loaded, it'll apply Markdown formatting to all of your
docstrings.

http://common-lisp.net/project/tinaa/
http://common-lisp.net/project/cl-markdown/

Osei Poku

2008-07-18 16:29:37 UTC

Permalink

The following info might also be useful..

[3268] OpenMCL kernel debugger: R
%rax = 0x0000000000000000 %r8 = 0x0000000000000000
%rcx = 0x0000000000000000 %r9 = 0x0000000040E577D0
%rdx = 0x0000000000000001 %r10 = 0x0000000000000008
%rbx = 0x00000000415837D0 %r11 = 0x0000000000000246
%rsp = 0x0000000040E56218 %r12 = 0x0000000040E577D0
%rbp = 0x0000000040E566F0 %r13 = 0x0000000040E56718
%rsi = 0x0000000000000001 %r14 = 0x0000000000000004
%rdi = 0x0000000000000000 %r15 = 0x0000000040E56AA0
%rip = 0x00002ADAFE2CA325 %rflags = 0x0000000000010246
[3268] OpenMCL kernel debugger: x
Unhandled exception 11 at 0x2adafe2ca325, context->regs at #x40e55d88
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: x
exception in foreign context
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: x
Unhandled exception 11 at 0x2adafe2ca325, context->regs at #x40e55d88
Exception occurred while executing foreign code
? for help
[3268] OpenMCL kernel debugger: t
Current Thread Context Record (tcr) = 0x40e577d0
Control (C) stack area: low = 0x40c04000, high = 0x40e58000
Value (lisp) stack area: low = 0x2aaaacfa1000, high = 0x2aaaad1b2000
Exception stack pointer = 0x40e56218

Yes; there are 3 calls to pthread_kill() in that file. One of them
(in resume_tcr()) is conditionlized out; the other two
(in raise_thread_interrupt() and suspend_tcr()) should check
to make sure that the thread that they'd pass as the first
argument to pthread_kill is non-zero before doing the call.)

Post by Osei Poku
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
Is there anything other than (rebuild-ccl :force t) that I need to
do to recompile the c source for the lisp kernel?

As Gail just pointed out, :full t (or :kernel t) is necessary
in order to get the kernel updated. (:force t will recompile
FASLs even if they're newer than the corresponding source;
that's occasionally useful, but not really what you want here.)
If the kernel that you're running had its modified date change
by the rebuild process, it likely incorporates those changes. If
those changes didn't fix the problem, then I don't have a good
guess as to what the problem is: there aren't too many places
where the lisp calls into the threads library: it creates threads
and sends them signals via pthread_kill(). (There's another place
where a thread will send itself a signal via pthread_kill(),
but that is pretty much guaranteed to be a valid thread ...)

Post by Osei Poku
Thanks,
Osei

Post by Gary Byers

Post by Osei Poku
Hi,
It crashed again for me. This time I managed to grab the
contents of
/proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the
repository to
r9890-RC1.
Osei

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.

Osei Poku

2008-07-18 16:32:11 UTC

Permalink

More debug info... Sorry about the multiple emails, I'm figuring
things out as I go.

(gdb) info threads
9 Thread 0x40263950 (LWP 3271) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
8 Thread 0x404c7950 (LWP 3272) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
7 Thread 0x4072b950 (LWP 3305) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
6 Thread 0x4098f950 (LWP 3306) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
5 Thread 0x40bf3950 (LWP 3307) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
4 Thread 0x40e57950 (LWP 6093) 0x00002adafe591bfb in read () from /
lib64/libc.so.6
3 Thread 0x4131f950 (LWP 6094) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
2 Thread 0x410bb950 (LWP 6095) 0x00002adafe2ca2cb in sem_timedwait
() from /lib64/libpthread.so.0
1 Thread 0x2adafe820880 (LWP 3268) 0x00002adafe2ca2cb in
sem_timedwait () from /lib64/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 0x40e57950 (LWP 6093))]#0
0x00002adafe591bfb in read () from /lib64/libc.so.6
(gdb) bt
#0 0x00002adafe591bfb in read () from /lib64/libc.so.6
#1 0x00002adafe545553 in _IO_file_underflow () from /lib64/libc.so.6
#2 0x00002adafe545d0e in _IO_default_uflow () from /lib64/libc.so.6
#3 0x00002adafe541404 in getc () from /lib64/libc.so.6
#4 0x000000000041d43d in readc () at /usr/include/bits/stdio.h:43
#5 0x000000000041d590 in lisp_Debugger (xp=0x40e55d60,
info=0x40e56110, why=11, in_foreign_code=1, message=0x40e55b10
"Unhandled exception 11 at 0x2adafe2ca325, context->regs at
#x40e55d88") at ../lisp-debug.c:914
#6 0x000000000041a2c6 in signal_handler (signum=11, info=0x40e56110,
context=0x40e55d60, tcr=0x40e577d0, old_valence=1) at ../x86-
exceptions.c:1070
#7 <signal handler called>
#8 0x00002adafe2ca325 in sem_post () from /lib64/libpthread.so.0
#9 0x000000000041b3e2 in resume_tcr (tcr=0x415837d0) at ../
thread_manager.c:1376
#10 0x000000000041c146 in lisp_resume_tcr (tcr=0x415837d0) at ../
thread_manager.c:1418
#11 0x000000000041a0c8 in handle_exception (signum=<value optimized
out>, info=0x40e56aa0, context=0x40e566f0, tcr=0x40e577d0,
old_valence=0) at ../x86-exceptions.c:910
#12 0x000000000041a218 in signal_handler (signum=4, info=0x40e56aa0,
context=0x40e566f0, tcr=0x40e577d0, old_valence=0) at ../x86-
exceptions.c:1064
#13 <signal handler called>
#14 0x00003000400110ab in ?? ()
#15 0x000030004042660c in ?? ()
#16 0x000000000040e0ac in _SPnthrowvalues () at ../x86-spentry64.s:1404
#17 0x00002aaaad1b1110 in ?? ()
#18 0x0000000000000008 in ?? ()
#19 0x0000000000000000 in ?? ()
(gdb)

Yes; there are 3 calls to pthread_kill() in that file. One of them
(in resume_tcr()) is conditionlized out; the other two
(in raise_thread_interrupt() and suspend_tcr()) should check
to make sure that the thread that they'd pass as the first
argument to pthread_kill is non-zero before doing the call.)

Post by Osei Poku
Clozure Common Lisp Version 1.2-r10073M-RC1 (LinuxX8664)!
Is there anything other than (rebuild-ccl :force t) that I need to
do to recompile the c source for the lisp kernel?

As Gail just pointed out, :full t (or :kernel t) is necessary
in order to get the kernel updated. (:force t will recompile
FASLs even if they're newer than the corresponding source;
that's occasionally useful, but not really what you want here.)
If the kernel that you're running had its modified date change
by the rebuild process, it likely incorporates those changes. If
those changes didn't fix the problem, then I don't have a good
guess as to what the problem is: there aren't too many places
where the lisp calls into the threads library: it creates threads
and sends them signals via pthread_kill(). (There's another place
where a thread will send itself a signal via pthread_kill(),
but that is pretty much guaranteed to be a valid thread ...)

Post by Osei Poku
Thanks,
Osei

Post by Gary Byers

Post by Osei Poku
Hi,
It crashed again for me. This time I managed to grab the
contents of
/proc/pid/maps before I killed it. Logs of the tty session and memory
maps are attached. I had also managed to update from the
repository to
r9890-RC1.
Osei

It seems to be crashed in the threads library (libpthread.so).
There's a race condition in the code which suspends threads
on entry to the GC: the thread that's running the GC looks
at each thread that it wants to suspend to see if it's
still alive (the data structure that represents a thread
might still be around, even if the OS-level thread has
exited.) The suspending thread looks at the tcr->osid
field of the target, notes that it's non-zero, then
calls a function to send the os-level thread a signal.
That function accesses the tcr->osid field again (which,
when non-zero, represents a POSIX thread ID) and calls
pthread_kill()).
When a thread dies, it clears its tcr->osid field, so
if the target thread dies between the point when the
suspending thread looks and the point where it leaps,
we wind up calling pthread_kill() with a first argument
of 0, and it crashes. That's consistent with the
register information: we're somewhere in the threads
library (possibly in pthread_kill()), and the register
in which C functions receive their first argument (%rdi)
is 0.
I'll try to check in a fix for that (look before leaping)
soon. As I understand it, SLIME will sometimes (depending
on the setting of a "communication style" variable)
spawn a thread in which to run each form being evaluated
(via C-M-x or whatever); whether that's a good idea or
not, consing short-lived threads all the time is probably
a good way to trigger this bug. I don't use SLIME, and
don't know what the consequences of changing the communication
style variable would be.