Use of 'primitive-fork' in (guix inferior) leads to hangs in 'cuirass evaluate'

  • Done
  • quality assurance status badge
Details
6 participants
  • Josselin Poiret
  • Ludovic Courtès
  • Christopher Baines
  • Maxim Cournoyer
  • Maxime Devos
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Maxim Cournoyer
Severity
important
M
M
Maxim Cournoyer wrote on 16 May 2022 05:49
[cuirass] hang in "In progress..."; runs out of pgsql connections
(name . bug-guix)(address . bug-guix@gnu.org)
878rr2kug6.fsf_-_@gmail.com
Hello Guix,

Since a couple days, Craps exhibits a new problem, where the
evaluations on master appear stalled, showing as "In progress..."
forever.

Eventually, it seems the maximum number of PostgreSQL connections
allowed (300, if what Ricardo wrote on #guix is accurate :-)) is
reached, and we get the following error:

Toggle snippet (30 lines)
Backtrace:
In ice-9/boot-9.scm:
1752:10 11 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
10 (apply-smob/0 #<thunk 7f4455f8b340>)
In ice-9/boot-9.scm:
724:2 9 (call-with-prompt _ _ #<procedure default-prompt-handle?>)
In ice-9/eval.scm:
619:8 8 (_ #(#(#<directory (guile-user) 7f4455f8ec80>)))
In cuirass/ui.scm:
104:10 7 (run-cuirass-command _ . _)
In cuirass/scripts/evaluate.scm:
101:7 6 (cuirass-evaluate _)
In cuirass/utils.scm:
123:6 5 (make-worker-thread-channel _ #:parallelism _)
In srfi/srfi-1.scm:
634:9 4 (for-each #<procedure 7f4451dd9e40 at cuirass/utils.sc?> ?)
In cuirass/utils.scm:
125:21 3 (_ . _)
In cuirass/scripts/evaluate.scm:
101:7 2 (_)
In cuirass/database.scm:
375:13 1 (db-open #:database _)
In squee.scm:
257:10 0 (connect-to-postgres-paramstring _)

squee.scm:257:10: In procedure connect-to-postgres-paramstring:
Throw to key `psql-connect-error' with args `(connection-bad "FATAL: remaining connection slots are reserved for non-replication superuser connections\n")'.


Restarting postgresql workarounds the above error, but it doesn't fix
the hang, and the problem connection problem recurs.

I've logged at the Cuirass and messages logs, but I haven't spot
anything abnormal.

Thanks,

Maxim
M
M
Mathieu Othacehe wrote on 16 May 2022 09:32
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87pmke6ig4.fsf@gnu.org
Hello Maxim,

Thanks for the report, the root cause here is the evaluation
hanging. The Postgres slots starvation is just a symptom caused by
having many concurrent (stuck) evaluation processes using slots.


The following lines can be caused by having a multi-threaded program:
(cuirass scripts evaluate) because of its par-for-each procedure, call
"primitive-fork" in open-bidirectional-pipe of (guix inferior).

Toggle snippet (5 lines)
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.

Not sure that it could cause the observed hang, Ludo what do you think?

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 16 May 2022 10:40
control message for bug #55441
(address . control@debbugs.gnu.org)
874k1pc1ld.fsf@gnu.org
severity 55441 important
quit
L
L
Ludovic Courtès wrote on 16 May 2022 11:13
Re: bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87o7zxalhu.fsf@gnu.org
Hi!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (12 lines)
>
> The following lines can be caused by having a multi-threaded program:
> (cuirass scripts evaluate) because of its par-for-each procedure, call
> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>
> warning: call to primitive-fork while multiple threads are running;
> further behavior unspecified. See "Processes" in the
> manual, for more information.
>
> Not sure that it could cause the observed hang, Ludo what do you think?

As you write, the warning comes from ‘open-bidirectional-pipe’, added in
commit bd86bbd300474204878e927f6cd3f0defa1662a5. I noticed it back then
but as far as I could see, it was harmless (but definitely suboptimal):


Indeed, “make cuirass-jobs” works for me.

I decided to go ahead and run ‘cuirass evaluate’ by hand to gather more
info:

Toggle snippet (18 lines)
ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c "/gnu/store/qamrzl942khb7w0cb4csyflnxfkwvvxq-cuirass-1.1.0-11.9f08035/bin/cuirass evaluate dbname=cuirass 323183"
Computing Guix derivation for 'x86_64-linux'... |
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
2022-05-16T09:56:43 Registering builds for evaluation 323183.
2022-05-16T10:08:47 Registering builds for evaluation 323183.

I straced it in another terminal. It did its work for a long while, and
eventually activity stopped:

Toggle snippet (38 lines)
read(228, "nu/store/7gi4qsakz775z5wcmddg8xan7q806sj1-findutils-4.8.0\")) (#:nix-name . \"findutils-4.8.0\") (#:sys"..., 4096) = 4096
read(228, "nrd-gcc-cross-x86_64-w64-mingw32-10.3.0.drv\" \"/gnu/store/ykrksw421rd22zh4w3pc751r90wc881n-patch-2.7."..., 4096) = 4096
read(228, "wm-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/xfh4965w2amrsl7x0h3dslrhndzn5bbz-linux-li"..., 4096) = 4096
read(228, "1ir-gawk-5.1.0.drv\" \"/gnu/store/p1d2w4fg1ljbyfs0ayv854snahcimamk-libunistring-0.9.10.drv\" \"/gnu/stor"..., 4096) = 4096
read(228, "3ysc-guile-2.2.7.drv\" \"/gnu/store/b0ma6l61ad4dfb3030sl1ngvdlicc523-glibc-utf8-locales-2.33.drv\" \"/gn"..., 4096) = 4096
read(228, "j6a5mc7w13xiw8136y4brv7w-xz-5.2.5.drv\" \"/gnu/store/w1bxsf02c4y0hj0nd39awv7i55ych641-libffi-3.3.drv\" "..., 4096) = 4096
read(228, "tore/qbqgamcpwgs2sw8lcr6pxl5y38fpyz0d-sed-4.8.drv\" \"/gnu/store/qmsbxbgb758h6f7018snxlwn69w9w9i4-diff"..., 4096) = 4096
read(228, "oconf-2.69.drv\" \"/gnu/store/6n5mn2vkl4x4fab37sx1rmqyqxqcwd4y-libidn2-2.3.1.drv\" \"/gnu/store/6q5cp0m2"..., 4096) = 4096
mmap(0x7fddd26af000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fddd268f000
read(228, "-binutils-cross-x86_64-w64-mingw32-2.37.drv\" \"/gnu/store/wdmk9llhdzx5y10mziss1i5z6q037a7z-guile-3.0."..., 4096) = 954
futex(0x7fdf015381f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7fdf015381f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01546b00, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGPWR {si_signo=SIGPWR, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigsuspend(~[INT QUIT ABRT BUS SEGV TERM XCPU RTMIN RT_1], 8) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[…]
--- SIGXCPU {si_signo=SIGXCPU, si_code=SI_TKILL, si_pid=64599, si_uid=997} ---
rt_sigreturn({mask=~[INT QUIT ABRT BUS KILL SEGV TERM STOP RTMIN RT_1]}) = -1 EINTR (Interrupted system call)
rt_sigreturn({mask=[]}) = 202
futex(0x7fdf01538f74, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7fdf01546c00, FUTEX_WAKE_PRIVATE, 1) = 0
close(224) = 0
futex(0x7fdf01538f70, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
[stops here]

The ‘read’ calls are about getting the jobset alists; FD 228 is a socket
(to a ‘guix repl’ subprocess I presume). SIGPWR/SIGXCPU suggests
attempts to make a stop-the-world GC.

GDB shows that the process has 120 threads (it’s a 96-core machine, but
still…) and most threads are stuck in pthread_cond_wait:

Toggle snippet (58 lines)
(gdb) bt
#0 0x00007fdf01d62afa in __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fded20cd8b4,
expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0,
cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:74
#1 0x00007fdf01d62b5b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fded20cd8b4,
expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0)
at ../sysdeps/nptl/futex-internal.c:123
#2 0x00007fdf01d5cab0 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fdef9c15fc0,
cond=cond@entry=0x7fded20cd888) at pthread_cond_wait.c:504
#3 __pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0) at pthread_cond_wait.c:619
#4 0x00007fdf01ea2a55 in scm_pthread_cond_wait (cond=cond@entry=0x7fded20cd888, mutex=mutex@entry=0x7fdef9c15fc0)
at threads.c:1605
#5 0x00007fdf01ea7b6b in block_self (
queue=((#<smob thread 7fdedfd05eb0>) #<smob thread 7fdedfd05af0> #<smob thread 7fdedfd05950> #<smob thread 7fdedfd05870> #<smob thread 7fdedfd05790> #<smob thread 7fdedfd056b0> #<smob thread 7fdedfd05570> #<smob thread 7fdedfd05490> #<smob thread 7fdedfd05350> #<smob thread 7fdedfd05150> #<smob thread 7fdedfd05070> #<smob thread 7fdedfd18f50> #<smob thread 7fdedfd18d70> #<smob thread 7fdedfd189a0> #<smob thread 7fdedfd18730> #<smob thread 7fdedfd185e0> #<smob thread 7fdedfd182b0> #<smob thread 7fdedfd18160> #<smob thread 7fdedfd21f50> #<smob thread 7fdedfd21e00> #<smob thread 7fdedfd21cb0> #<smob thread 7fdedfd217a0> #<smob thread 7fdedfd21650> #<smob thread 7fdedfd21500> #<smob thread 7fdedfd213b0> #<smob thread 7fdedfd21260> #<smob thread 7fdedfd2ff90> #<smob thread 7fdedfd2fe60> #<smob thread 7fdedfd2fcf0> #<smob thread 7fdedfd2fba0> #<smob thread 7fdedfd2fa50> #<smob thread 7fdedfd2f900> #<smob thread 7fdedfd2f650> #<smob thread 7fdedfd2f570> #<smob thread 7fdedfd2f430> #<smob thread 7fdedfd2f310> #<smob thread 7fdedfd2f1a0> #<smob thread 7fdedfd2f0b0> #<smob thread 7fdedfd4efd0> #<smob thread 7fdedfd4ee90> #<smob thread 7fdedfd4ebd0> #<smob thread 7fdedfd4e940> #<smob thread 7fdedfd4e760> #<smob thread 7fdedfd4e660> #<smob thread 7fdedfd4e270> #<smob thread 7fdedfdbefc0> #<smob thread 7fdedfdbeec0> #<smob thread 7fdedfdbede0> #<smob thread 7fdedfdbed00> #<smob thread 7fdedfdbeb00> #<smob thread 7fdedfdbe8f0> #<smob thread 7fdedfdbe810> #<smob thread 7fdedfdbe730> #<smob thread 7fdedfdbe650> #<smob thread 7fdedfdbe3f0> #<smob thread 7fdedfdbe220> #<smob thread 7fdedfdbe120> #<smob thread 7fdedfdbe020> #<smob thread 7fdedfa34610> #<smob thread 7fdedfa344d0> #<smob thread 7fdedfa34110> #<smob thread 7fdedfa52e90> #<smob thread 7fdedfa52d30> #<smob thread 7fdedfa52c40> #<smob thread 7fdedfa737e0> #<smob thread 7fdedfa73400> #<smob thread 7fdedfa73000> #<smob thread 7fdedfa88ef0> #<smob thread 7fdedfa88cc0> #<smob thread 7fdedfa888e0> #<smob thread 7fdedfa88800> #<smob thread 7fdedfa88720> #<smob thread 7fdedfa88620> #<smob thread 7fdedfa884b0> #<smob thread 7fdedfa88370> #<smob thread 7fdedfa880b0> #<smob thread 7fdedfb06fa0> #<smob thread 7fdedfb06b90> #<smob thread 7fdedfb066a0> #<smob thread 7fdedfb06470> #<smob thread 7fdedfb062e0> #<smob thread 7fdedfb06140> #<smob thread 7fdedfb58f80> #<smob thread 7fdedfb58af0> #<smob thread 7fdedfb58830> #<smob thread 7fdedfb93cd0> #<smob thread 7fdedfb93ae0> #<smob thread 7fdedfb936c0> #<smob thread 7fdedfb935c0> #<smob thread 7fdedfb93400> #<smob thread 7fdedfb932c0> #<smob thread 7fdedfd05ff0> #<smob thread 7fdedfd05c20> #<smob thread 7fdedfd05eb0>), mutex=mutex@entry=0x7fdef9c15fc0, waittime=waittime@entry=0x0) at threads.c:312
#6 0x00007fdf01eaa1cd in timed_wait (waittime=0x0, current_thread=0x7fded20cd6c0, c=0x7fdef9c10d20,
m=0x7fdef9c15fc0, kind=SCM_MUTEX_STANDARD) at threads.c:1331
#7 scm_timed_wait_condition_variable (cond=#<smob condition-variable 7fdef9c122e0>,
mutex=#<smob mutex 7fdef9c12320>, timeout=#<undefined 904>) at threads.c:1424
#8 0x00007fded169a7f3 in ?? ()
#9 0x0000000000b83688 in ?? ()
#10 0x00007fded169a750 in ?? ()
#11 0x00007fdeec009e98 in ?? ()
#12 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb8369c "\034\330\003") at jit.c:6038
#13 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#14 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#15 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef732e880>) at async.c:406
#16 0x00007fdf01ead336 in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:972
#17 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#18 0x00007fdf01ea9be6 in really_launch (d=0x7fdef730c860) at threads.c:778
#19 0x00007fdf01e2085a in c_body (d=0x7fdec8c44d80) at continuations.c:430
#20 0x00007fded2874532 in ?? ()
#21 0x00007fdef9a227e0 in ?? ()
#22 0x00007fdef7d8c370 in ?? ()
#23 0x0000000000000048 in ?? ()
#24 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded20cd6c0, mcode=0xb81874 "\034<\003") at jit.c:6038
#25 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded20cd6c0) at vm-engine.c:360
#26 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1608
#27 0x00007fdf01e2209a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#28 0x00007fdf01ed8752 in scm_c_with_exception_handler.constprop.0 (type=#t,
handler_data=handler_data@entry=0x7fdec8c44d10, thunk_data=thunk_data@entry=0x7fdec8c44d10,
thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#29 0x00007fdf01eaa88f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>,
handler=<optimized out>, handler_data=<optimized out>, pre_unwind_handler=<optimized out>,
pre_unwind_handler_data=0x7fdef9b97040) at throw.c:168
#30 0x00007fdf01e22e66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7fdf01e22b80 <pre_unwind_handler>,
pre_unwind_handler_data=0x7fdef9b97040, handler_data=0x7fdec8c44d80, handler=0x7fdf01e298b0 <c_handler>,
body_data=0x7fdec8c44d80, body=0x7fdf01e20850 <c_body>) at continuations.c:368
#31 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#32 0x00007fdf01ea9b39 in with_guile (base=0x7fdec8c44e08, data=0x7fdec8c44e30) at threads.c:645
#33 0x00007fdf01d800ba in GC_call_with_stack_base ()
from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#34 0x00007fdf01ea216d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x7fdef730c860,
func=0x7fdf01ea9b70 <really_launch>) at threads.c:688
#35 launch_thread (d=0x7fdef730c860) at threads.c:787
#36 0x00007fdf01d56d7e in start_thread (arg=0x7fdec8c45640) at pthread_create.c:473
#37 0x00007fdf01954eff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

There’s also one thread stuck in read(2):

Toggle snippet (38 lines)
(gdb) bt
#0 __libc_read (nbytes=1, buf=0x7fdedf97bdd0, fd=7) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=7, buf=buf@entry=0x7fdedf97bdd0, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007fdf01e382e8 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=1)
at fports.c:597
#3 0x00007fdf01e78d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
dst="#<vu8vector>" = {...}, start=start@entry=0, count=1) at ports.c:1566
#4 0x00007fdf01e7b1c7 in scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
minimum_size=1, minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc3e8,
avail_out=avail_out@entry=0x7fde86fcc3e0) at ports.c:2693
#5 0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc448, cur_out=cur_out@entry=0x7fde86fcc440) at ports.c:1526
#6 0x00007fdf01e7c1b5 in maybe_consume_bom (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
bom=bom@entry=0x7fdf01ee56ad <scm_utf8_bom> "\357\273\277scm_i_read_bytes", bom_len=bom_len@entry=3)
at ports.c:2468
#7 0x00007fdf01e7b324 in port_clear_stream_start_for_bom_read (io_mode=BOM_IO_TEXT,
port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>) at ports.c:2504
#8 scm_fill_input (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>, minimum_size=1,
minimum_size@entry=0, cur_out=cur_out@entry=0x7fde86fcc4e8, avail_out=avail_out@entry=0x7fde86fcc4e0)
at ports.c:2648
#9 0x00007fdf01e7b434 in peek_byte_or_eof (port=port@entry=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc548, cur_out=cur_out@entry=0x7fde86fcc540) at ports.c:1526
#10 0x00007fdf01e7b97c in peek_utf8_codepoint (port=#<port #<port-type file 7fdef9b5fb40> 7fdef59b7180>,
buf_out=buf_out@entry=0x7fde86fcc5a8, cur_out=cur_out@entry=0x7fde86fcc5a0, len_out=len_out@entry=0x7fde86fcc598)
at ports.c:1790
#11 0x00007fdf01e7bee4 in peek_codepoint (len=0x7fde86fcc598, cur=0x7fde86fcc5a0, buf=0x7fde86fcc5a8,
port=<optimized out>) at ports.c:1984
#12 scm_peek_char (port=<optimized out>) at ports.c:2202
#13 0x00007fdef997b1ab in ?? ()
#14 0x0000000000b82ae0 in ?? ()
#15 0x00007fdef997b120 in ?? ()
#16 0x00007fdec4292c48 in ?? ()
#17 0x00007fdf01e57ccc in scm_jit_enter_mcode (thread=0x7fded1366000, mcode=0xb82af0 "\034\234\003") at jit.c:6038
#18 0x00007fdf01eacf3c in vm_regular_engine (thread=0x7fded1366000) at vm-engine.c:360
#19 0x00007fdf01eba5e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#20 0x00007fdf01e1ea0e in scm_call_with_unblocked_asyncs (proc=#<program 7fdef59a0f60>) at async.c:406

A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
all the data it asked for from ‘guix repl’, but it’s still waiting on
read(2) even though there’s nothing left. (There are two child ‘guix
repl’ processes and both are waiting on read(0, …).)

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 16 May 2022 12:32
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87k0alahug.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (5 lines)
> A plausible scenario is a synchronization issue: ‘cuirass evaluate’ got
> all the data it asked for from ‘guix repl’, but it’s still waiting on
> read(2) even though there’s nothing left. (There are two child ‘guix
> repl’ processes and both are waiting on read(0, …).)

This reminds me of:


I’ve added a missing call to ‘close-inferior’. It’s a good idea, though
I’m not entirely convinced yet it’ll solve the problem:


Ludo’.
L
L
Ludovic Courtès wrote on 16 May 2022 14:19
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87fsl9acuw.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (5 lines)
> I’ve added a missing call to ‘close-inferior’. It’s a good idea, though
> I’m not entirely convinced yet it’ll solve the problem:
>
> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=f087aaf685dbc7cc18f0254895f4a4b0dfaba631

I tested it like this and it ran to completion without hanging:

Toggle snippet (14 lines)
ludo@berlin ~$ guix build cuirass --with-branch=cuirass=master
[…]
/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master
ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c "/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master/bin/cuirass evaluate dbname=cuirass 323183"
Password:
Computing Guix derivation for 'x86_64-linux'... \
2022-05-16T12:39:18 Registering builds for evaluation 323183.
2022-05-16T12:40:34 Registering builds for evaluation 323183.
GC Warning: Repeated allocation of very large block (appr. size 14385152):
May lead to memory leak and poor performance
2022-05-16T12:41:25 Registering builds for evaluation 323183.
2022-05-16T12:44:35 Registering builds for evaluation 323183.

Should we try and deploy this commit on berlin?

Maxim, how can we proceed?

Ludo’.
M
M
Maxim Cournoyer wrote on 16 May 2022 14:49
(name . Ludovic Courtès)(address . ludo@gnu.org)
87h75pk5gn.fsf@gmail.com
Hi Ludo and Mathieu!

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (22 lines)
> Hi!
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> I noticed this error: https://ci.guix.gnu.org/eval/320682/log/raw.
>>
>> The following lines can be caused by having a multi-threaded program:
>> (cuirass scripts evaluate) because of its par-for-each procedure, call
>> "primitive-fork" in open-bidirectional-pipe of (guix inferior).
>>
>> warning: call to primitive-fork while multiple threads are running;
>> further behavior unspecified. See "Processes" in the
>> manual, for more information.
>>
>> Not sure that it could cause the observed hang, Ludo what do you think?
>
> As you write, the warning comes from ‘open-bidirectional-pipe’, added in
> commit bd86bbd300474204878e927f6cd3f0defa1662a5. I noticed it back then
> but as far as I could see, it was harmless (but definitely suboptimal):
>
> https://issues.guix.gnu.org/48007#11

Thanks for investigating it!

I'm not yet very knowledgeable about POSIX threads and shortcomings, but
this looks like a dangerous trap to have lying around :-). Is there an
alternative implementation we could come up with that would avoid the
potential problem (and warnings) entirely?

Otherwise, if we are sure there are no practical problems in a
multi-threaded scenario, I think a comment explaining the reasoning in
the code (of why it's considered safe despite the warnings) and a test
exercising the behavior would be in order to ensure it continues working
as intended in the face of changes.

Thoughts?

Thanks again,

Maxim
M
M
Maxim Cournoyer wrote on 16 May 2022 19:32
(name . Ludovic Courtès)(address . ludo@gnu.org)
878rr1jsd1.fsf@gmail.com
Hello,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (26 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> I’ve added a missing call to ‘close-inferior’. It’s a good idea, though
>> I’m not entirely convinced yet it’ll solve the problem:
>>
>> https://git.savannah.gnu.org/cgit/guix/guix-cuirass.git/commit/?id=f087aaf685dbc7cc18f0254895f4a4b0dfaba631
>
> I tested it like this and it ran to completion without hanging:
>
> ludo@berlin ~$ guix build cuirass --with-branch=cuirass=master
> […]
> /gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master
> ludo@berlin ~$ sudo su - cuirass -s /bin/sh -c "/gnu/store/sqsf2q3qf9d485mcw6lm14abwr54na01-cuirass-git.master/bin/cuirass evaluate dbname=cuirass 323183"
> Password:
> Computing Guix derivation for 'x86_64-linux'... \
> 2022-05-16T12:39:18 Registering builds for evaluation 323183.
> 2022-05-16T12:40:34 Registering builds for evaluation 323183.
> GC Warning: Repeated allocation of very large block (appr. size 14385152):
> May lead to memory leak and poor performance
> 2022-05-16T12:41:25 Registering builds for evaluation 323183.
> 2022-05-16T12:44:35 Registering builds for evaluation 323183.
>
> Should we try and deploy this commit on berlin?
>
> Maxim, how can we proceed?

Berlin was reconfigured with this commit of Cuirass, and is now running
the derivations with it, but so far still "In progress..." after more
than 100 minutes [0]


Looking in /var/log/cuirass.log, I can see:

Toggle snippet (16 lines)
2022-05-16 15:57:41 2022-05-16T15:57:41 next evaluation in 300 seconds
2022-05-16 15:59:02 Uncaught exception in fiber ##f:
2022-05-16 15:59:02 In cuirass/base.scm:
2022-05-16 15:59:02 726:13 3 (_)
2022-05-16 15:59:02 In ice-9/boot-9.scm:
2022-05-16 15:59:02 1752:10 2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:59:02 1685:16 1 (raise-exception _ #:continuable? _)
2022-05-16 15:59:02 1683:16 0 (raise-exception _ #:continuable? _)
2022-05-16 15:59:02 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:59:02 ERROR:
2022-05-16 15:59:02 1. &evaluation-error:
2022-05-16 15:59:02 name: "core"
2022-05-16 15:59:02 id: 325375
2022-05-16 16:02:41 2022-05-16T16:02:41 Fetching channels for spec 'core'.

But it seems it should only affect another job specification ('core') ?

Before, there was also:

Toggle snippet (14 lines)
2022-05-16 15:41:34 Uncaught exception in fiber ##f:
2022-05-16 15:41:34 In cuirass/base.scm:
2022-05-16 15:41:34 726:13 3 (_)
2022-05-16 15:41:34 In ice-9/boot-9.scm:
2022-05-16 15:41:34 1752:10 2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:41:34 1685:16 1 (raise-exception _ #:continuable? _)
2022-05-16 15:41:34 1683:16 0 (raise-exception _ #:continuable? _)
2022-05-16 15:41:34 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:41:34 ERROR:
2022-05-16 15:41:34 1. &evaluation-error:
2022-05-16 15:41:34 name: "purge-python2-packages"
2022-05-16 15:41:34 id: 325332

and

Toggle snippet (14 lines)
2022-05-16 15:05:27 Uncaught exception in fiber ##f:
2022-05-16 15:05:27 In cuirass/base.scm:
2022-05-16 15:05:27 726:13 3 (_)
2022-05-16 15:05:27 In ice-9/boot-9.scm:
2022-05-16 15:05:27 1752:10 2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 15:05:27 1685:16 1 (raise-exception _ #:continuable? _)
2022-05-16 15:05:27 1683:16 0 (raise-exception _ #:continuable? _)
2022-05-16 15:05:27 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 15:05:27 ERROR:
2022-05-16 15:05:27 1. &evaluation-error:
2022-05-16 15:05:27 name: "master"
2022-05-16 15:05:27 id: 324938

and

Toggle snippet (14 lines)
2022-05-16 13:02:07 Uncaught exception in fiber ##f:
2022-05-16 13:02:07 In cuirass/base.scm:
2022-05-16 13:02:07 726:13 3 (_)
2022-05-16 13:02:07 In ice-9/boot-9.scm:
2022-05-16 13:02:07 1752:10 2 (with-exception-handler _ _ #:unwind? _ # _)
2022-05-16 13:02:07 1685:16 1 (raise-exception _ #:continuable? _)
2022-05-16 13:02:07 1683:16 0 (raise-exception _ #:continuable? _)
2022-05-16 13:02:07 ice-9/boot-9.scm:1683:16: In procedure raise-exception:
2022-05-16 13:02:07 ERROR:
2022-05-16 13:02:07 1. &evaluation-error:
2022-05-16 13:02:07 name: "guix"
2022-05-16 13:02:07 id: 324937

I don't know if these are related or not; probably not, as their
timestamps are older by more than 3 hours, while the last derivations
were started less than 1h30 ago.

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 17 May 2022 09:45
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87fsl87gb5.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (4 lines)
> Berlin was reconfigured with this commit of Cuirass, and is now running
> the derivations with it, but so far still "In progress..." after more
> than 100 minutes [0]

Yeah, I’m not sure about the backtrace you report, however there’s again
a bunch of ‘cuirass evaluate’ processes hanging, this time with the main
thread stuck on ‘waitpid’ (presumably from the ‘close-inferior’ call):

Toggle snippet (14 lines)
#0 0x00007f0886310f27 in __GI___wait4 (pid=86099, stat_loc=stat_loc@entry=0x7ffea0cb849c, options=options@entry=0, usage=usage@entry=0x0)
at ../sysdeps/unix/sysv/linux/wait4.c:30
#1 0x00007f0886310ea7 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7ffea0cb849c, options=options@entry=0)
at waitpid.c:38
#2 0x00007f08868ae25e in scm_waitpid (pid=86099, options=<optimized out>) at posix.c:727
#3 0x00007f088689a336 in vm_regular_engine (thread=0x7f08861fad80) at vm-engine.c:972
#4 0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=1) at vm.c:1608
#5 0x00007f088680f457 in scm_primitive_eval (exp=<optimized out>,
exp@entry=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/.cuirass-real") (main (command-line)) (quit)))) at eval.c:671
#6 0x00007f08868154b6 in scm_eval (
exp=((@ (ice-9 control) %) (begin ((@@ (ice-9 command-line) load/lang) "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/.cuirass-real") (main (command-line)) (quit))), module_or_state="#<struct module>" = {...}) at eval.c:705
#7 0x00007f08868793b6 in scm_shell (argc=9, argv=0x7ffea0cb8c98) at script.c:357

Process 86099 (the one it’s waiting for) is indeed ‘guix repl’ and it’s
waiting for input in read(0, …).

There’s a second thread stuck in ‘waitpid’:

Toggle snippet (18 lines)
(gdb) thread 27
[Switching to thread 27 (LWP 86002)]
#0 0x00007f0886310f27 in __GI___wait4 (pid=86100, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0, usage=usage@entry=0x0)
at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb) bt
#0 0x00007f0886310f27 in __GI___wait4 (pid=86100, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0, usage=usage@entry=0x0)
at ../sysdeps/unix/sysv/linux/wait4.c:30
#1 0x00007f0886310ea7 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x7f085393c60c, options=options@entry=0)
at waitpid.c:38
#2 0x00007f08868ae25e in scm_waitpid (pid=86100, options=<optimized out>) at posix.c:727
#3 0x00007f088689a336 in vm_regular_engine (thread=0x7f087d54e240) at vm-engine.c:972
#4 0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#5 0x00007f088680ba0e in scm_call_with_unblocked_asyncs (proc=#<program 7f0854a50f40>) at async.c:406
#6 0x00007f088689a336 in vm_regular_engine (thread=0x7f087d54e240) at vm-engine.c:972
#7 0x00007f08868a75e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608

and then a couple of threads in ‘read’:

Toggle snippet (19 lines)
#0 __libc_read (nbytes=1, buf=0x7f085a69ab90, fd=5) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __libc_read (fd=5, buf=buf@entry=0x7f085a69ab90, nbytes=nbytes@entry=1) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f08868252e8 in fport_read (port=<optimized out>, dst=<optimized out>, start=<optimized out>, count=1) at fports.c:597
#3 0x00007f0886865d22 in scm_i_read_bytes (port=port@entry=#<port #<port-type file 7f087e821b40> 7f0854af6d20>, dst="#<vu8vector>" = {...},
start=start@entry=0, count=1) at ports.c:1566
#4 0x00007f08868681c7 in scm_fill_input (port=port@entry=#<port #<port-type file 7f087e821b40> 7f0854af6d20>,
minimum_size=minimum_size@entry=1, cur_out=cur_out@entry=0x7f085313b5e0, avail_out=avail_out@entry=0x7f085313b588) at ports.c:2693
#5 0x00007f0886868d5c in peek_iconv_codepoint (port=#<port #<port-type file 7f087e821b40> 7f0854af6d20>, buf=buf@entry=0x7f085313b5e8,
cur=cur@entry=0x7f085313b5e0, len=len@entry=0x7f085313b5d8) at ports.c:1944
#6 0x00007f0886868e4a in peek_codepoint (len=0x7f085313b5d8, cur=0x7f085313b5e0, buf=0x7f085313b5e8, port=<optimized out>) at ports.c:1988
#7 scm_peek_char (port=<optimized out>) at ports.c:2202
#8 0x00007f087e62582b in ?? ()
#9 0x0000000000857760 in ?? ()
#10 0x00007f087e6257a0 in ?? ()
#11 0x00007f087d850c48 in ?? ()
#12 0x00007f0886844ccc in scm_jit_enter_mcode (thread=0x7f087d54e000, mcode=0x857770 "\034\234\003") at jit.c:6038
#13 0x00007f0886899f3c in vm_regular_engine (thread=0x7f087d54e000) at vm-engine.c:360

Normally we call ‘waitpid’ once the pipe has been closed:

Toggle snippet (12 lines)
(define* (open-inferior directory
#:key (command "bin/guix")
(error-port (%make-void-port "w")))
"Open the inferior Guix in DIRECTORY, running 'DIRECTORY/COMMAND repl' or
equivalent. Return #f if the inferior could not be launched."
(let ((pipe pid (inferior-pipe directory command error-port)))
(port->inferior pipe
(lambda (port)
(close-port port)
(waitpid pid))))) ;<----- here

… and when the pipe is closed, the child ‘guix repl’ process gets EOF
and exits.

So I’m not sure why the ‘guix repl’ process would stick around.

Thoughts?

Ludo’.
L
L
Ludovic Courtès wrote on 17 May 2022 14:52
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87pmkc5nk9.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (5 lines)
> I'm not yet very knowledgeable about POSIX threads and shortcomings, but
> this looks like a dangerous trap to have lying around :-). Is there an
> alternative implementation we could come up with that would avoid the
> potential problem (and warnings) entirely?

It *is* a dangerous trap, mitigated by the fact that our code execs soon
after fork, but that’s not great (“A fork() on the Road” explains the
many shortcomings of ‘fork’, including how it interacts with
multi-threaded code.)

As discussed on IRC yesterday, the longer-term approach is to augment
Guile is a posix_spawn-like primitive implemented in C. If it’s in C,
we can make it do the right thing safely, as is done with ‘open-pipe’.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 18 May 2022 09:44
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
874k1n5loz.fsf@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (12 lines)
> Hi,
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Berlin was reconfigured with this commit of Cuirass, and is now running
>> the derivations with it, but so far still "In progress..." after more
>> than 100 minutes [0]
>
> Yeah, I’m not sure about the backtrace you report, however there’s again
> a bunch of ‘cuirass evaluate’ processes hanging, this time with the main
> thread stuck on ‘waitpid’ (presumably from the ‘close-inferior’ call):

An update: I ran this by hand:

Toggle snippet (20 lines)
$ sudo su - cuirass -s /bin/sh -c "/gnu/store/z8haznhwck4bjm4gxqy25wvwv4041wvx-cuirass-1.1.0-12.f087aaf/bin/cuirass evaluate dbname=cuirass 326988"
Computing Guix derivation for 'x86_64-linux'... -^-
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
furthwarning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.
warning: call to primitive-fork while multiple threads are running;
further behavior unspecified. See "Processes" in the
manual, for more information.

… and attached ‘strace’, which shows things are working as intended:

Toggle snippet (12 lines)
read(228, "r6x878i3fyr12lf9v496cl9x5jh6-gzip-1.10.drv\" \"/gnu/store/6280r7vqfp5qkv6cdib02203xw1sk7h3-binutils-cr"..., 4096) = 4096
read(228, "pper-0.drv\" \"/gnu/store/45wsr6x878i3fyr12lf9v496cl9x5jh6-gzip-1.10.drv\" \"/gnu/store/6280r7vqfp5qkv6c"..., 4096) = 4096
read(228, "8qy2pnpn2030xxfmm56h71-gettext-0.21-doc\") (\"out\" . \"/gnu/store/gmrqdvy9v42hfbs6b7vaxfrwf8b5qhqy-gett"..., 4096) = 4096
read(228, "\" \"/gnu/store/k5889v7ms3f5x1rjr3php71k4743fn19-bash-minimal-5.1.8.drv\" \"/gnu/store/lbyw1h3xz33qnb08j"..., 4096) = 4096
read(228, "x2g7ysm16xqnrpbkqdc4hhni1r42dgg-patch-2.7.6.drv\" \"/gnu/store/3fy0f7gy85ddy6rpa4mmhjygns8qzk03-findut"..., 4096) = 4096
read(228, "08s1nz9bpv6k6a56idv6l7r2zjqphszl-file-5.39.drv\" \"/gnu/store/1x2g7ysm16xqnrpbkqdc4hhni1r42dgg-patch-2"..., 4096) = 4096
read(228, "tore/azdhavgwpjc527n81iqwrf91v8sv3x23-help2man-1.48.5.drv\" \"/gnu/store/b1pi5ssin56s4mqq6964hsh2bdkyd"..., 4096) = 4096
read(228, ") (#:nix-name . \"guix-1.3.0-25.c1719a0\") (#:system . \"x86_64-linux\") (#:max-silent-time . 3600) (#:t"..., 4096) = 121
close(228) = 0
wait4(83882, # hangs forever

The question to me becomes: why is ‘guix repl’ not getting EOF from its
read(0, …) call?

FWIW, here’s an estimate of the amount of data transferred from the
inferior to the main process:

Toggle snippet (4 lines)
$ grep 'read(228, .* = 4096$' <log.evaluate|wc -l
25807

I’m trying to reproduce that synthetically.

Ludo’.
L
L
Ludovic Courtès wrote on 20 May 2022 20:32
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87r14ovyud.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (6 lines)
> wait4(83882, # hangs forever
>
>
> The question to me becomes: why is ‘guix repl’ not getting EOF from its
> read(0, …) call?

The answer was that another child process (an inferior) was holding a
copy of the socketpair file descriptor connected to that ‘guix repl’
standard input.

Fixed in Guix commit a4994d739306abcf3f36706012fb88b35a970e6b with a
test that reproduces the issue.

Commit d02b7abe24fac84ef1fb1880f51d56fc9fb6cfef updates the ‘guix’
package so we should be able to reconfigure berlin now and hopefully
(crossing fingers!) be done with it.

Ludo’.
L
L
Ludovic Courtès wrote on 24 May 2022 23:02
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87k0aaiqzu.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> Fixed in Guix commit a4994d739306abcf3f36706012fb88b35a970e6b with a
> test that reproduces the issue.
>
> Commit d02b7abe24fac84ef1fb1880f51d56fc9fb6cfef updates the ‘guix’
> package so we should be able to reconfigure berlin now and hopefully
> (crossing fingers!) be done with it.

An update: Cuirass is now up-to-date on berlin.guix, built from Guix
commit adf5ae5a412ed13302186dd4ce8e2df783d4515d.

Unfortunately, while evaluations now run to completion, child processes
of ‘cuirass evaluate’ stick around at the end:

Toggle snippet (46 lines)
(gdb) bt
#0 futex_wait (private=0, expected=2, futex_word=0x7f5b1d054f08) at ../sysdeps/nptl/futex-internal.h:146
#1 __lll_lock_wait (futex=futex@entry=0x7f5b1d054f08, private=0) at lowlevellock.c:52
#2 0x00007f5b1d873ef3 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f5b1d054f08) at ../nptl/pthread_mutex_lock.c:80
#3 0x00007f5b1d995303 in scm_c_weak_set_remove_x (pred=<optimized out>, closure=0x7f5b13dd8d00, raw_hash=1824276156261873434, set=#<weak-set 7f5b156772f0>) at weak-set.c:794
#4 scm_weak_set_remove_x (obj=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>, set=#<weak-set 7f5b156772f0>) at weak-set.c:817
#5 close_port (explicit=<optimized out>, port=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>) at ports.c:891
#6 close_port (port=#<port #<port-type file 7f5b1567ab40> 7f5b13dd8d00>, explicit=<optimized out>) at ports.c:874
#7 0x00007f5af3a7df82 in ?? ()
#8 0x0000000000dbd860 in ?? ()
#9 0x00007f5af3a7df60 in ?? ()
#10 0x0000000000db82b8 in ?? ()
#11 0x00007f5b1d972ccc in scm_jit_enter_mcode (thread=0x7f5b157bf240, mcode=0xdbd86c "\034\217\003") at jit.c:6038
#12 0x00007f5b1d9c7f3c in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:360
#13 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#14 0x00007f5b1d939a0e in scm_call_with_unblocked_asyncs (proc=#<program 7f5aebcd7f40>) at async.c:406
#15 0x00007f5b1d9c8336 in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:972
#16 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=0) at vm.c:1608
#17 0x00007f5b1d9c4be6 in really_launch (d=0x7f5aebccac80) at threads.c:778
#18 0x00007f5b1d93b85a in c_body (d=0x7f5aea691d80) at continuations.c:430
#19 0x00007f5aeeb118c2 in ?? ()
#20 0x00007f5b1553d7e0 in ?? ()
#21 0x00007f5b138a7370 in ?? ()
#22 0x0000000000000048 in ?? ()
#23 0x00007f5b1d972ccc in scm_jit_enter_mcode (thread=0x7f5b157bf240, mcode=0xdbc874 "\034<\003") at jit.c:6038
#24 0x00007f5b1d9c7f3c in vm_regular_engine (thread=0x7f5b157bf240) at vm-engine.c:360
#25 0x00007f5b1d9d55e9 in scm_call_n (proc=<optimized out>, argv=<optimized out>, nargs=2) at vm.c:1608
#26 0x00007f5b1d93d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at eval.c:503
#27 0x00007f5b1d9f3752 in scm_c_with_exception_handler.constprop.0 (type=#t, handler_data=handler_data@entry=0x7f5aea691d10, thunk_data=thunk_data@entry=0x7f5aea691d10,
thunk=<optimized out>, handler=<optimized out>) at exceptions.c:170
#28 0x00007f5b1d9c588f in scm_c_catch (tag=<optimized out>, body=<optimized out>, body_data=<optimized out>, handler=<optimized out>, handler_data=<optimized out>,
pre_unwind_handler=<optimized out>, pre_unwind_handler_data=0x7f5b156b2040) at throw.c:168
#29 0x00007f5b1d93de66 in scm_i_with_continuation_barrier (pre_unwind_handler=0x7f5b1d93db80 <pre_unwind_handler>, pre_unwind_handler_data=0x7f5b156b2040, handler_data=0x7f5aea691d80,
handler=0x7f5b1d9448b0 <c_handler>, body_data=0x7f5aea691d80, body=0x7f5b1d93b850 <c_body>) at continuations.c:368
#30 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized out>) at continuations.c:464
#31 0x00007f5b1d9c4b39 in with_guile (base=0x7f5aea691e08, data=0x7f5aea691e30) at threads.c:645
#32 0x00007f5b1d89b0ba in GC_call_with_stack_base () from /gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#33 0x00007f5b1d9bd16d in scm_i_with_guile (dynamic_state=<optimized out>, data=0x7f5aebccac80, func=0x7f5b1d9c4b70 <really_launch>) at threads.c:688
#34 launch_thread (d=0x7f5aebccac80) at threads.c:787
#35 0x00007f5b1d871d7e in start_thread (arg=0x7f5aea692640) at pthread_create.c:473
#36 0x00007f5b1d46feff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) info threads
Id Target Id Frame
* 1 process 53801 "guile" futex_wait (private=0, expected=2, futex_word=0x7f5b1d054f08) at ../sysdeps/nptl/futex-internal.h:146

Notice there’s a single thread: it very much looks like the random
results one gets when forking a multithreaded process (in this case,
this one thread is a finalization thread, except it’s running in a
process that doesn’t actually have the other Guile threads). The
fork+threads problem is already manifesting, after all.

I’ll try and come up with a solution to that, if nobody beats me at it.
What’s annoying is that it’s not easy to test: the problem doesn’t
manifest on my 4-core laptop, but it does on the 96-core berlin.

To be continued…

Ludo’.
L
L
Ludovic Courtès wrote on 25 May 2022 18:32
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87bkvlh8u2.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (6 lines)
> Notice there’s a single thread: it very much looks like the random
> results one gets when forking a multithreaded process (in this case,
> this one thread is a finalization thread, except it’s running in a
> process that doesn’t actually have the other Guile threads). The
> fork+threads problem is already manifesting, after all.

An update: The ‘with-database’ used by (cuirass scripts evaluate),
creates threads unconditionally and leaves them around when its body has
exited. So there’s currently no way in (cuirass scripts evaluate) to
arrange so that ‘open-inferior’ is called in a single-threaded context
(I tried that in ‘wip-bug-55441’).

Bummer. Looks like we’re left with unpleasant solutions, such as
writing bindings for ‘posix_spawn’ and using that in (guix inferior).

Mathieu, any ideas?

Ludo’.
L
L
Ludovic Courtès wrote on 25 May 2022 20:21
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
875ylth3rq.fsf@gnu.org
For now, I’m going to go with the solution below, which is to use an
older revision of Guix (one where ‘open-inferior’ was using
‘open-pipe*’) as the dependency of the ‘cuirass’ package.

I’m running “cuirass evaluate” manually on berlin to make sure it
actually works. If everything goes well, I’ll push it and reconfigure
berlin later today or tomorrow.

Ludo’.
Toggle diff (55 lines)
diff --git a/gnu/packages/ci.scm b/gnu/packages/ci.scm
index ab421fe870..9cd2f6884e 100644
--- a/gnu/packages/ci.scm
+++ b/gnu/packages/ci.scm
@@ -164,7 +164,14 @@ (define-public cuirass
;; FIXME: this is propagated by "guile-git", but it needs to be among
;; the inputs to add it to GUILE_LOAD_PATH.
guile-bytestructures
- guix))
+
+ ;; FIXME: The 'cuirass evaluate' command is multithreaded, but it
+ ;; uses 'open-inferior', which calls 'primitive-fork', thus
+ ;; potentially creating child processes that deadlock. To work
+ ;; around that, use the last revision of Guix where
+ ;; 'open-inferior' was using 'open-pipe*' rather than
+ ;; 'primitive-fork'. See <https://issues.guix.gnu.org/55441>.
+ guix-for-cuirass))
(native-inputs
(list autoconf automake pkg-config texinfo ephemeralpg))
(native-search-paths
diff --git a/gnu/packages/package-management.scm b/gnu/packages/package-management.scm
index 619458a0d3..16ad4fe9db 100644
--- a/gnu/packages/package-management.scm
+++ b/gnu/packages/package-management.scm
@@ -489,6 +489,30 @@ (define code
(license license:gpl3+)
(properties '((ftp-server . "alpha.gnu.org"))))))
+(define-public guix-for-cuirass
+ ;; Known-good revision before commit
+ ;; bd86bbd300474204878e927f6cd3f0defa1662a5, which introduced
+ ;; 'primitive-fork' in 'open-inferior'.
+ (let ((version "1.3.0")
+ (commit "a27e47f9d1e22dc32bb250cfeef88cfacb930e23")
+ (revision 23))
+ (package
+ (inherit guix)
+ (version (string-append version "-"
+ (number->string revision)
+ "." (string-take commit 7)))
+ (source (origin
+ (method git-fetch)
+ (uri (git-reference
+ (url "https://git.savannah.gnu.org/git/guix.git")
+ (commit commit)))
+ (sha256
+ (base32
+ "12jmvagbw05hmmlrb82i0qazhlv7mcfnl4dmknwx3a9hd760g9y1"))
+ (file-name (string-append "guix-" version "-checkout"))))
+ (properties `((hidden? . #t)
+ ,@(package-properties guix))))))
+
(define-public guix-daemon
;; This package is for internal consumption: it allows us to quickly build
;; the 'guix-daemon' program and use that in (guix self), used by 'guix
C
C
Christopher Baines wrote on 26 May 2022 01:13
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 55441@debbugs.gnu.org)
87h75dtcaa.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (8 lines)
> For now, I’m going to go with the solution below, which is to use an
> older revision of Guix (one where ‘open-inferior’ was using
> ‘open-pipe*’) as the dependency of the ‘cuirass’ package.
>
> I’m running “cuirass evaluate” manually on berlin to make sure it
> actually works. If everything goes well, I’ll push it and reconfigure
> berlin later today or tomorrow.

To put in an email something I put on IRC earlier.

Maybe the store connection caching could be optional when calling
inferior-eval-with-store, and that could also switch between using
open-pipe* and primitive-fork for starting the inferior process.

I'm guessing the use of primitive-fork for starting the inferior process
is causing problems with Cuirass in some cases, and it's possible that
it'll affect the data service in a similar way as well.

I don't think the connection caching actually benefits Cuirass though,
since it only calls inferior-eval-with-store once per inferior.

Additionally, on the data service side, the caching functionality is
actually undesirable as it leads to the inferior process running out of
memory, so currently the cache is manually cleared in various places
[1].

-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmKOvY1fFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9XciYQ//WfemT9xfTUCL0aKz8Ew73mNfhH0gmNfd
tCZZKBOZPoVNgEXPpt3W5JEsr8r+/mhF5ymFn+N++xetwCkRDlCCRepp96oXs7RX
wqpTfCIX3VG9dpUbUgb3OGkuzPvjkzx6OQ/A7o8jPKgKv1nu4M4JD78le7ZjMeQE
7S8YbSZRtSsAQaYoQfWAAflytqyRq5NeH/QopqC72sxOqDOtdb/6ZaRC1f8sIx8D
AdhEHjQGpMinUAMbu0T14Qvwle6n/AkxXRMSp0D9bs2pXhqw3X5tz8OM+d07K/dl
BGqjQ82ZY+DAnEq3y/GsvffV3yVha8YuI3qhsDAY8xEASS5l9iYgsYSmOA1+GLhK
NBdhL5q9SXX6MEyEu8C8FSyDOBN4SAXMmsPVfjeOJYwBQJbg8XWAuBpGwexz1cD0
g5Ke6BduC/Wl9c0VK6eJv3rsEXSAFr9C0qRCVmkdqKj6Ij1kLKOIUjsGsiCr0esw
mSYsx2Bi4hN3Ssku0dNJvztMG3Izq7FywJp8gqQodUEaK1vNxL71fWP5WTNR6cjD
ksEJcQ/4KXtPa4fP9pAzNIs6ay17ESj87T7JW6CHFBro8/FUkBbHXxRgo7QdWktG
aqASbbbVkcvDmuNdUuVmJbFBOfMSkUbEdAjVA0dWDLevPeYbt3N2SfA11M0MnEak
qR2SUYz1ptU=
=xr7r
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 26 May 2022 11:44
(name . Christopher Baines)(address . mail@cbaines.net)(address . 55441@debbugs.gnu.org)
871qwghbme.fsf@gnu.org
Hi,

Christopher Baines <mail@cbaines.net> skribis:

Toggle quote (4 lines)
> Maybe the store connection caching could be optional when calling
> inferior-eval-with-store, and that could also switch between using
> open-pipe* and primitive-fork for starting the inferior process.

I don’t think that’s desirable as an interface. To me, the proper fix
will be to use an interface like ‘posix_spawn’ so we can have both
correctness and speed, rather than let users choose.

Ludo’.
L
L
Ludovic Courtès wrote on 26 May 2022 14:29
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87sfowfpf2.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (8 lines)
> For now, I’m going to go with the solution below, which is to use an
> older revision of Guix (one where ‘open-inferior’ was using
> ‘open-pipe*’) as the dependency of the ‘cuirass’ package.
>
> I’m running “cuirass evaluate” manually on berlin to make sure it
> actually works. If everything goes well, I’ll push it and reconfigure
> berlin later today or tomorrow.

Pushed commit as dcb7ce1eb6911f9d503e7cd2bfe380058cee956b.

I reconfigured berlin a couple of hours ago:

Toggle snippet (22 lines)
ludo@berlin ~/src/maintenance/hydra$ guix system describe
Generation 576 May 26 2022 12:50:31 (current)
file name: /var/guix/profiles/system-576-link
canonical file name: /gnu/store/dd4p4cw67p5wbwwv7szgz0zxsvkq78zb-system
label: GNU with Linux-Libre 5.17.9
bootloader: grub
root device: label: "my-root"
kernel: /gnu/store/si7fz3pxwq5aiqpdwkw0qs4jmsqh89n5-linux-libre-5.17.9/bzImage
channels:
guix:
repository URL: https://git.savannah.gnu.org/git/guix.git
branch: master
commit: 2262ee2c63fd92a17836ff7450623db76ed39d92
configuration file: /gnu/store/sd59z8h741jilaa3cdf16441f84xghil-configuration.scm
ludo@berlin ~/src/maintenance/hydra$ git log |head -5
commit 07f587cb266f284dee6ac2e744161973d3512b04
Author: Ludovic Courtès <ludo@gnu.org>
Date: Tue May 24 14:57:36 2022 +0200

hydra: Add missing service type 'description' field.

There’s been a couple of evaluations (346596 and 346651 in the ‘master’
jobset) and so far things seem to work smoothly.

Let’s keep monitoring.

Ludo’.
J
J
Josselin Poiret wrote on 26 May 2022 22:49
(address . 55441@debbugs.gnu.org)
87sfow10lc.fsf@jpoiret.xyz
Hello everyone,

With all the recent platform changes, I was looking at `make
cuirass-jobs` for the first time and it left some deadlocked inferiors
as well, so I figured I could try my own solution, since I've already
dabbled in Guile POSIX internals! Here's a patch that rewrites
open-bidirectional-pipe with a more thread-safe alternative,
`piped-process`, but this also requires some Guile patches, that I will
post as well. With these, `make cuirass-jobs` doesn't hang at all
anymore for me.

Let me know what you think of these

Best,
--
Josselin Poiret
J
J
Josselin Poiret wrote on 26 May 2022 22:50
[PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
20220526205016.23899-1-dev@jpoiret.xyz
* guix/inferior.scm: Load up scm_init_popen extension to be able to
use piped-process.
(open-bidirectional-pipe): Remove custom fork+exec code and use
piped-process instead.
---
guix/inferior.scm | 39 ++++++++++++++++++---------------------
1 file changed, 18 insertions(+), 21 deletions(-)

Toggle diff (59 lines)
diff --git a/guix/inferior.scm b/guix/inferior.scm
index 54200b75e4..a4a4c1d40e 100644
--- a/guix/inferior.scm
+++ b/guix/inferior.scm
@@ -134,6 +134,10 @@ (define (write-inferior inferior port)
(set-record-type-printer! <inferior> write-inferior)
+(eval-when (expand load eval)
+ (load-extension (string-append "libguile-" (effective-version))
+ "scm_init_popen"))
+
(define (open-bidirectional-pipe command . args)
"Open a bidirectional pipe to COMMAND invoked with ARGS and return it, as a
regular file port (socket).
@@ -147,27 +151,20 @@ (define (open-bidirectional-pipe command . args)
;; the REPL process wouldn't get EOF on standard input.
(match (socketpair AF_UNIX (logior SOCK_STREAM SOCK_CLOEXEC) 0)
((parent . child)
- (match (primitive-fork)
- (0
- (dynamic-wind
- (lambda ()
- #t)
- (lambda ()
- (close-port parent)
- (close-fdes 0)
- (close-fdes 1)
- (dup2 (fileno child) 0)
- (dup2 (fileno child) 1)
- ;; Mimic 'open-pipe*'.
- (unless (file-port? (current-error-port))
- (close-fdes 2)
- (dup2 (open-fdes "/dev/null" O_WRONLY) 2))
- (apply execlp command command args))
- (lambda ()
- (primitive-_exit 127))))
- (pid
- (close-port child)
- (values parent pid))))))
+ (let* ((proc
+ (lambda ()
+ (piped-process command args
+ ;; Use port->fdes on child to increase the
+ ;; revealed count, so that the fd does not get
+ ;; closed again when child gets gc'd, as
+ ;; piped-process already closes it.
+ (cons (fileno parent) (port->fdes child))
+ (cons (fileno child) (fileno parent)))))
+ (pid
+ (if (file-port? (current-error-port))
+ (proc)
+ (with-error-to-file "/dev/null" proc))))
+ (values parent pid)))))
(define* (inferior-pipe directory command error-port)
"Return two values: an input/output pipe on the Guix instance in DIRECTORY
--
2.36.0
J
J
Josselin Poiret wrote on 26 May 2022 22:50
[PATCH 1/2] Fix child spawning closing standard fds prematurely
20220526205047.23960-1-dev@jpoiret.xyz
* libguile/posix.c (renumber_file_descriptor): Refactor it as
dup_handle_error.
(dup_handle_error, dup2_handle_error): New functions that wrap around
dup and dup2 by retrying on EINTR or EBUSY, as well as erroring out on
other errors.
(start_child): Close standard file descriptors only
after all of them have been dup2'd.
---
libguile/posix.c | 82 +++++++++++++++++++++++++++++++-----------------
1 file changed, 53 insertions(+), 29 deletions(-)

Toggle diff (110 lines)
diff --git a/libguile/posix.c b/libguile/posix.c
index 3ab12b99e..dc3080b3c 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1280,14 +1280,14 @@ SCM_DEFINE (scm_fork, "primitive-fork", 0, 0, 0,
#endif /* HAVE_FORK */
#ifdef HAVE_FORK
-/* 'renumber_file_descriptor' is a helper function for 'start_child'
- below, and is specialized for that particular environment where it
- doesn't make sense to report errors via exceptions. It uses dup(2)
- to duplicate the file descriptor FD, closes the original FD, and
- returns the new descriptor. If dup(2) fails, print an error message
- to ERR and abort. */
+/* 'dup_handle_error' is a helper function for 'start_child' below, and
+ is specialized for that particular environment where it doesn't make
+ sense to report errors via exceptions. It uses dup(2) to duplicate
+ the file descriptor FD, does *not* close the original FD, and returns
+ the new descriptor. If dup(2) fails, print an error message to ERR
+ and abort. */
static int
-renumber_file_descriptor (int fd, int err)
+dup_handle_error (int fd, int err)
{
int new_fd;
@@ -1304,7 +1304,33 @@ renumber_file_descriptor (int fd, int err)
_exit (127); /* Use exit status 127, as with other exec errors. */
}
- close (fd);
+ return new_fd;
+}
+
+/* 'dup2_handle_error' is a helper function for 'start_child' below, and
+ is specialized for that particular environment where it doesn't make
+ sense to report errors via exceptions. It uses dup2(2) to duplicate
+ the file descriptor FD, does *not* close the original FD, and returns
+ the new descriptor. If dup2(2) fails, print an error message to ERR
+ and abort. */
+static int
+dup2_handle_error (int fd, int to, int err)
+{
+ int new_fd;
+
+ do
+ new_fd = dup2 (fd, to);
+ while (new_fd == -1 && (errno == EINTR || errno == EBUSY));
+
+ if (new_fd == -1)
+ {
+ /* At this point we are in the child process before exec. We
+ cannot safely raise an exception in this environment. */
+ const char *msg = strerror (errno);
+ fprintf (fdopen (err, "a"), "start_child: dup failed: %s\n", msg);
+ _exit (127); /* Use exit status 127, as with other exec errors. */
+ }
+
return new_fd;
}
#endif /* HAVE_FORK */
@@ -1357,27 +1383,25 @@ start_child (const char *exec_file, char **exec_argv,
if (err == -1)
err = open ("/dev/null", O_WRONLY);
- if (in > 0)
- {
- if (out == 0)
- out = renumber_file_descriptor (out, err);
- if (err == 0)
- err = renumber_file_descriptor (err, err);
- do dup2 (in, 0); while (errno == EINTR);
- close (in);
- }
- if (out > 1)
- {
- if (err == 1)
- err = renumber_file_descriptor (err, err);
- do dup2 (out, 1); while (errno == EINTR);
- close (out);
- }
- if (err > 2)
- {
- do dup2 (err, 2); while (errno == EINTR);
- close (err);
- }
+ /* Dup each non-yet-dup2'd fd that's in the way to the next available fd,
+ so that we can safely dup2 to 0/1/2 without potentially overwriting
+ in/out/err. Note that dup2 doesn't do anything if its arguments are
+ equal. */
+ if (out == 0)
+ out = dup_handle_error (out, err);
+ if (err == 0)
+ err = dup_handle_error (err, err);
+ dup2_handle_error (in, 0, err);
+
+ if (err == 1)
+ err = dup_handle_error (err, err);
+ dup2_handle_error (out, 1, err);
+
+ dup2_handle_error (err, 2, err);
+
+ if (in > 2) close (in);
+ if (out > 2) close (out);
+ if (err > 2) close (err);
execvp (exec_file, exec_argv);
--
2.36.0
J
J
Josselin Poiret wrote on 26 May 2022 22:50
[PATCH 2/2] Improve thread safety of piped-process.
20220526205047.23960-2-dev@jpoiret.xyz
* libguile/posix.c (scm_piped_process): Avoid double close.
---
libguile/posix.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Toggle diff (15 lines)
diff --git a/libguile/posix.c b/libguile/posix.c
index dc3080b3c..e134408e3 100644
--- a/libguile/posix.c
+++ b/libguile/posix.c
@@ -1486,7 +1486,7 @@ scm_piped_process (SCM prog, SCM args, SCM from, SCM to)
SCM_SYSERROR;
}
- if (reading)
+ if (reading && !(writing && c2p[1] == p2c[0]))
close (c2p[1]);
if (writing)
close (p2c[0]);
--
2.36.0
M
M
Maxime Devos wrote on 26 May 2022 22:55
Re: bug#55441: [PATCH 1/2] Fix child spawning closing standard fds prematurely
(address . 55441@debbugs.gnu.org)
4e763e7f1dafa427a1cf82df3241b90282085931.camel@telenet.be
Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:50 [+0200]:
Toggle quote (2 lines)
> +      fprintf (fdopen (err, "a"), "start_child: dup failed: %s\n", msg);

fdopen can fail due to out-of-memory problems, so you'll have to check
the return value, or use dprintf instead which accepts a file
descriptor.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYo/pNRccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7talAP9oVkIikXdtWSzxvOeu2raaTF02
ReTzA5XqSaNnJhyCQwEAtzuFYJrpDBQj5uTwIE14lv97DjuYJHQl9SYhjRmK5Qk=
=wFJT
-----END PGP SIGNATURE-----


M
M
Maxime Devos wrote on 26 May 2022 22:58
Re: bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
(address . 55441@debbugs.gnu.org)
16f8aa929bcba3536da4701f9b1152f28ca4b893.camel@telenet.be
Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:49 [+0200]:
Toggle quote (15 lines)
> Hello everyone,
>
> With all the recent platform changes, I was looking at `make
> cuirass-jobs` for the first time and it left some deadlocked inferiors
> as well, so I figured I could try my own solution, since I've already
> dabbled in Guile POSIX internals! Here's a patch that rewrites
> open-bidirectional-pipe with a more thread-safe alternative,
> `piped-process`, but this also requires some Guile patches, that I will
> post as well. With these, `make cuirass-jobs` doesn't hang at all
> anymore for me.
>
> Let me know what you think of these
>
> Best,

The process opening code looks fragile, so I think some tests are
required to avoid causing regressions later.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYo/qAhccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7qOgAP9m6yapGzXa40adLWf3FBueIL8/
PFrPZk7zLX9PlvzUswD/Wf7op9MzB9rWvwEBRppdN55wbFD2SzQk4DgPI9Ru0wE=
=WKTD
-----END PGP SIGNATURE-----


M
M
Maxime Devos wrote on 26 May 2022 23:02
Re: bug#55441: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
(address . 55441@debbugs.gnu.org)
273851b9eabb7f2fa0ae02ec6aaf074292e35580.camel@telenet.be
Josselin Poiret via Bug reports for GNU Guix schreef op do 26-05-2022
om 22:50 [+0200]:
Toggle quote (3 lines)
> * guix/inferior.scm: Load up scm_init_popen extension to be able to
> use piped-process.

Can we export the procedure instead from (ice-9 popen) + use @@ for
now, instead of doing this low-level thing?

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYo/qzBccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7rTyAP4qrXSn/6ihdtnwDVrh6Bp0dHLR
M+JUowRjeiry8rvnVwD8C34OeGVOxxEXceful4LCdf3XYJyNANUC7iNclxHu7gM=
=W4a0
-----END PGP SIGNATURE-----


J
J
Josselin Poiret wrote on 28 May 2022 16:02
Re: bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
(address . 55441@debbugs.gnu.org)
87ilpphi12.fsf@jpoiret.xyz
Hello Maxime,

Thanks for your review. I took your comments into account and actually
wrote some more patches to clean up start_child of potential thread
unsafe practices. Since these are not guix-specific, I sent the whole
patchset to guile-devel instead [1].

Maxime Devos <maximedevos@telenet.be> writes:
Toggle quote (3 lines)
> The process opening code looks fragile, so I think some tests are
> required to avoid causing regressions later.

How would you suggest we do that? Since these are mostly racing issues
between threads, I don't know how we could reproduce these reliably in
tests.

[1] v4 PATCH at
(20220528124634.17353-1-dev@jpoiret.xyz)

Best,
--
Josselin Poiret
L
L
Ludovic Courtès wrote on 28 May 2022 19:15
Re: [PATCH] guix: inferior: Make open-bidirectional-pipe use piped-process.
(name . Josselin Poiret)(address . dev@jpoiret.xyz)
87h759bmt3.fsf@gnu.org
Hi,

Josselin Poiret <dev@jpoiret.xyz> skribis:

Toggle quote (5 lines)
> * guix/inferior.scm: Load up scm_init_popen extension to be able to
> use piped-process.
> (open-bidirectional-pipe): Remove custom fork+exec code and use
> piped-process instead.

[...]

Toggle quote (4 lines)
> +(eval-when (expand load eval)
> + (load-extension (string-append "libguile-" (effective-version))
> + "scm_init_popen"))

To me this is not an option because it’s using an internal function and
it’s too fragile (that C function is not meant to be called more than
once).

That said, I agree that in principle we should be doing something along
these lines: calling out to a C function that does fork+dup+exec in a
safe way.

Thanks for looking into this!

Ludo’.
L
L
Ludovic Courtès wrote on 26 Jan 2023 15:39
Re: bug#55441: [cuirass] hang in "In progress..."; runs out of pgsql connections
(name . Josselin Poiret)(address . dev@jpoiret.xyz)
87pmb1nyv6.fsf@gnu.org
Hi there!

I have some good news! Time has passed but we didn’t remain idle :-)
and the scary bug at https://issues.guix.gnu.org/55441 is finding a
happy end!

First, Josselin introduced bindings to ‘posix_spawn’¹ in Guile, to use
as the foundation of a new ‘spawn’ procedure, but also for ‘system*’,
‘open-pipe*’, & co.:


That’s big deal because ‘posix_spawn’ is usable in a multi-threaded
context (POSIX doesn’t explicitly say so, but it stems from the fact
that what happens in the child process before ‘exec’ is “safe”, and in
particular Guile code doesn’t get a chance to run).

Guile 3.0.9 released yesterday includes those changes.

Then we have this commit, where I changed ‘open-bidirectional-pipe’ in
(guix inferior) to use ‘spawn’ where available:

fed3953d70 inferior: Use 'spawn' on Guile 3.0.9+.

That solves the problem that led to this bug.

Last, we have this update:

230de2e94b gnu: guix: Update to d5fece6.

It gives us a ‘guix’ package that contains the ‘spawn’ fix and removes
‘guix-for-cuirass’, which is now longer necessary.

Thanks everyone for the team work, very nice when working on touch
issues like this one!

Ludo’.

L
L
Ludovic Courtès wrote on 26 Jan 2023 15:41
control message for bug #55441
(address . control@debbugs.gnu.org)
87o7qlnys8.fsf@gnu.org
retitle 55441 Use of 'primitive-fork' in (guix inferior) leads to hangs in 'cuirass evaluate'
quit
L
L
Ludovic Courtès wrote on 26 Jan 2023 15:41
(address . control@debbugs.gnu.org)
87mt65nys4.fsf@gnu.org
close 55441
quit
?