Module system thread unsafety and .go compilation

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Maxim Cournoyer
  • Taylan Ulrich Bay?rl? /Kammer
Owner
unassigned
Submitted by
Taylan Ulrich Bay?rl? /Kammer
Severity
important
T
T
Taylan Ulrich Bay?rl? /Kammer wrote on 9 Feb 2016 21:02
(address . bug-guix@gnu.org)(address . guile-devel@gnu.org)
8737t1k5yk.fsf@T420.taylan
To speed up the compilation of the many Scheme files in Guix, we use a
script that first loads all modules to be compiled into the Guile
process (by calling 'resolve-interface' on the module names), and then
the corresponding Scheme files are compiled in a par-for-each.

While Guile's module system is known to be thread unsafe, the idea was
that all mutation should happen in the serial loading phase, and the
parallel compile-file calls should then be thread safe.

Sadly that assumption isn't met when autoloads are involved.
Minimal-ish test-case:

- Check out 0889321.

- Build it.

- Edit gnu/build/activation.scm and gnu/build/linux-boot.scm to contain
merely the following expressions, respectively:

(define-module (gnu build activation)
#:use-module (gnu build linux-boot))

(define-module (gnu build linux-boot)
#:autoload (system base compile) (compile-file))

- Run make again.

If you're on a multi-core system, you will probably get an error saying
something weird like "no such language scheme".

Note: when you then run make *again* it succeeds.


Solution proposals:

1. s/par-for-each/for-each/. Will make compilation slower on multi-core
machines. We would do the same for guix pull, which is a bit sad
because it's so fast right now. Very simple solution though.

2. We find out some partitioning of the Scheme modules such that there
is minimal overlap in total loaded modules when the modules in one
subset are each loaded by one Guile process. Then each Guile process
loads & compiles the modules in its given subset serially, but these
Guile processes run in parallel. This could speed things up even
more than now because the module-loading phases of the processes
would be parallel too. It also has the side-effect that less memory
is consumed the fewer cores you have (because less Scheme modules
loaded into memory at once). If someone (Ludo?) has a good general
overview of Guix's module graph then maybe they can come up with a
sensible partitioning of the modules, say into 4 subsets (maxing out
benefits at quad-core), such that loading all modules in one subset
loads a minimal amount of modules that are outside that subset. That
should be the only challenging part of this solution.

3. We do nothing for now since this bug triggers rarely, and can be
worked around by simply re-running make. (We just have to hope that
it doesn't trigger on guix pull or on clean builds after some commit;
there's no "just rerun make" in guix pull or an automated build of
Guix.) AFAIU Wingo expressed motivation to make Guile's module
system thread safe, so this problem would then truly disappear.

I think #2 is a pretty good solution. The only thing worrying me is
that we might not be able to sensibly partition the Scheme modules
according to any simple logic that can be automated (like guix/ is one
subset, gnu/packages/ is another, etc.). Maintaining the subsets
manually in the Makefile would be pretty ugly. But maybe some simple
logic, possibly combined with few special-cases in the code, would be
good enough.

Thoughts?

Taylan
L
L
Ludovic Courtès wrote on 10 Feb 2016 14:50
(name . Taylan Ulrich "Bay?rl? /Kammer")(address . taylanbayirli@gmail.com)
87fux0n07o.fsf@gnu.org
taylanbayirli@gmail.com (Taylan Ulrich "Bay?rl?/Kammer") skribis:

Toggle quote (21 lines)
> Sadly that assumption isn't met when autoloads are involved.
> Minimal-ish test-case:
>
> - Check out 0889321.
>
> - Build it.
>
> - Edit gnu/build/activation.scm and gnu/build/linux-boot.scm to contain
> merely the following expressions, respectively:
>
> (define-module (gnu build activation)
> #:use-module (gnu build linux-boot))
>
> (define-module (gnu build linux-boot)
> #:autoload (system base compile) (compile-file))
>
> - Run make again.
>
> If you're on a multi-core system, you will probably get an error saying
> something weird like "no such language scheme".

Do you have a clear explanation of why this happens? I would expect
(system base compile) to already be loaded for instance, so it’s not
clear to me what’s going on. Or is it just the mutation of (gnu build
linux-boot) that’s causing problems?

Toggle quote (28 lines)
> Solution proposals:
>
> 1. s/par-for-each/for-each/. Will make compilation slower on multi-core
> machines. We would do the same for guix pull, which is a bit sad
> because it's so fast right now. Very simple solution though.
>
> 2. We find out some partitioning of the Scheme modules such that there
> is minimal overlap in total loaded modules when the modules in one
> subset are each loaded by one Guile process. Then each Guile process
> loads & compiles the modules in its given subset serially, but these
> Guile processes run in parallel. This could speed things up even
> more than now because the module-loading phases of the processes
> would be parallel too. It also has the side-effect that less memory
> is consumed the fewer cores you have (because less Scheme modules
> loaded into memory at once). If someone (Ludo?) has a good general
> overview of Guix's module graph then maybe they can come up with a
> sensible partitioning of the modules, say into 4 subsets (maxing out
> benefits at quad-core), such that loading all modules in one subset
> loads a minimal amount of modules that are outside that subset. That
> should be the only challenging part of this solution.
>
> 3. We do nothing for now since this bug triggers rarely, and can be
> worked around by simply re-running make. (We just have to hope that
> it doesn't trigger on guix pull or on clean builds after some commit;
> there's no "just rerun make" in guix pull or an automated build of
> Guix.) AFAIU Wingo expressed motivation to make Guile's module
> system thread safe, so this problem would then truly disappear.

Short-term, I’d do #1 or #3; probably #1 though, because random failures
are no fun, and we know they can happen.

Longer-term, I’m not convinced by #2. I think I would instead build
packages in reverse topological order, probably serially at first, which
would address http://bugs.gnu.org/15602 (with the caveat that the (gnu
packages …) modules cannot be topologically-sorted, but OTOH they
typically don’t use macros, so we’re fine.)

That would require a tool to extract and the ‘define-module’ forms and
build a graph from there.

But really, we must fix http://bugs.gnu.org/15602, an in particular,
‘compile-file’ should not mutate the global module name space. I think
we could do something like:

(define (compile-file* …)
(let ((root the-root-module)
(compile-root (copy-module the-root-module)))
(dynamic-wind
(lambda ()
(set! the-root-module compile-root)
;; ditto with the-scm-module
)
(lambda ()
(compile-file …))
(lambda ()
(set! the-root-module root)
;; …
))))

It’s unclear how costly ‘copy-module’ would be, and the whole strategy
depends on it.

Eventually it seems clear that Guile proper needs to address this use
case, and needs to provide thread-safe modules.

Ludo’.
L
L
Ludovic Courtès wrote on 10 Feb 2016 14:50
(name . Taylan Ulrich "Bay?rl? /Kammer")(address . taylanbayirli@gmail.com)
87egckn07h.fsf@gnu.org
taylanbayirli@gmail.com (Taylan Ulrich "Bay?rl?/Kammer") skribis:

Toggle quote (21 lines)
> Sadly that assumption isn't met when autoloads are involved.
> Minimal-ish test-case:
>
> - Check out 0889321.
>
> - Build it.
>
> - Edit gnu/build/activation.scm and gnu/build/linux-boot.scm to contain
> merely the following expressions, respectively:
>
> (define-module (gnu build activation)
> #:use-module (gnu build linux-boot))
>
> (define-module (gnu build linux-boot)
> #:autoload (system base compile) (compile-file))
>
> - Run make again.
>
> If you're on a multi-core system, you will probably get an error saying
> something weird like "no such language scheme".

Do you have a clear explanation of why this happens? I would expect
(system base compile) to already be loaded for instance, so it’s not
clear to me what’s going on. Or is it just the mutation of (gnu build
linux-boot) that’s causing problems?

Toggle quote (28 lines)
> Solution proposals:
>
> 1. s/par-for-each/for-each/. Will make compilation slower on multi-core
> machines. We would do the same for guix pull, which is a bit sad
> because it's so fast right now. Very simple solution though.
>
> 2. We find out some partitioning of the Scheme modules such that there
> is minimal overlap in total loaded modules when the modules in one
> subset are each loaded by one Guile process. Then each Guile process
> loads & compiles the modules in its given subset serially, but these
> Guile processes run in parallel. This could speed things up even
> more than now because the module-loading phases of the processes
> would be parallel too. It also has the side-effect that less memory
> is consumed the fewer cores you have (because less Scheme modules
> loaded into memory at once). If someone (Ludo?) has a good general
> overview of Guix's module graph then maybe they can come up with a
> sensible partitioning of the modules, say into 4 subsets (maxing out
> benefits at quad-core), such that loading all modules in one subset
> loads a minimal amount of modules that are outside that subset. That
> should be the only challenging part of this solution.
>
> 3. We do nothing for now since this bug triggers rarely, and can be
> worked around by simply re-running make. (We just have to hope that
> it doesn't trigger on guix pull or on clean builds after some commit;
> there's no "just rerun make" in guix pull or an automated build of
> Guix.) AFAIU Wingo expressed motivation to make Guile's module
> system thread safe, so this problem would then truly disappear.

Short-term, I’d do #1 or #3; probably #1 though, because random failures
are no fun, and we know they can happen.

Longer-term, I’m not convinced by #2. I think I would instead build
packages in reverse topological order, probably serially at first, which
would address http://bugs.gnu.org/15602 (with the caveat that the (gnu
packages …) modules cannot be topologically-sorted, but OTOH they
typically don’t use macros, so we’re fine.)

That would require a tool to extract and the ‘define-module’ forms and
build a graph from there.

But really, we must fix http://bugs.gnu.org/15602, an in particular,
‘compile-file’ should not mutate the global module name space. I think
we could do something like:

(define (compile-file* …)
(let ((root the-root-module)
(compile-root (copy-module the-root-module)))
(dynamic-wind
(lambda ()
(set! the-root-module compile-root)
;; ditto with the-scm-module
)
(lambda ()
(compile-file …))
(lambda ()
(set! the-root-module root)
;; …
))))

It’s unclear how costly ‘copy-module’ would be, and the whole strategy
depends on it.

Eventually it seems clear that Guile proper needs to address this use
case, and needs to provide thread-safe modules.

Ludo’.
L
L
Ludovic Courtès wrote on 23 Feb 2016 14:26
control message for bug #22608
(address . control@debbugs.gnu.org)
87lh6bk169.fsf@gnu.org
severity 22608 important
L
L
Ludovic Courtès wrote on 4 Jul 2018 00:10
Re: bug#22608: Module system thread unsafety and .go compilation
(name . Taylan Ulrich "Bay?rl? /Kammer")(address . taylanbayirli@gmail.com)
87d0w4ezst.fsf@gnu.org
Hello,

taylanbayirli@gmail.com (Taylan Ulrich "Bay?rl?/Kammer") skribis:

Toggle quote (11 lines)
> To speed up the compilation of the many Scheme files in Guix, we use a
> script that first loads all modules to be compiled into the Guile
> process (by calling 'resolve-interface' on the module names), and then
> the corresponding Scheme files are compiled in a par-for-each.
>
> While Guile's module system is known to be thread unsafe, the idea was
> that all mutation should happen in the serial loading phase, and the
> parallel compile-file calls should then be thread safe.
>
> Sadly that assumption isn't met when autoloads are involved.

For the record, these issues should be fixed in Guile 2.2.4:

533e3ff17 * Serialize accesses to submodule hash tables.
46bcbfa56 * Module import obarrays are accessed in a critical section.
761cf0fb8 * Make module autoloading thread-safe.

‘guix pull’ now defaults to 2.2.4, so we’ll see if indeed those crashes
disappear.

Ludo’.
M
M
Maxim Cournoyer wrote on 8 Oct 2022 02:21
(name . Ludovic Courtès)(address . ludo@gnu.org)
87lepr412m.fsf@gmail.com
Hi,

ludo@gnu.org (Ludovic Courtès) writes:

Toggle quote (24 lines)
> Hello,
>
> taylanbayirli@gmail.com (Taylan Ulrich "Bay?rl?/Kammer") skribis:
>
>> To speed up the compilation of the many Scheme files in Guix, we use a
>> script that first loads all modules to be compiled into the Guile
>> process (by calling 'resolve-interface' on the module names), and then
>> the corresponding Scheme files are compiled in a par-for-each.
>>
>> While Guile's module system is known to be thread unsafe, the idea was
>> that all mutation should happen in the serial loading phase, and the
>> parallel compile-file calls should then be thread safe.
>>
>> Sadly that assumption isn't met when autoloads are involved.
>
> For the record, these issues should be fixed in Guile 2.2.4:
>
> 533e3ff17 * Serialize accesses to submodule hash tables.
> 46bcbfa56 * Module import obarrays are accessed in a critical section.
> 761cf0fb8 * Make module autoloading thread-safe.
>
> ‘guix pull’ now defaults to 2.2.4, so we’ll see if indeed those crashes
> disappear.

I think we haven't seen these in the last 4 years! We still have
references to https://bugs.gnu.org/15602in our code base though;
although the upstream issue appears to have been fixed. Could we remove
the workarounds now?

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 10 Oct 2022 10:07
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87v8os3xur.fsf@gnu.org
Hi!

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (2 lines)
> ludo@gnu.org (Ludovic Courtès) writes:

[...]

Toggle quote (14 lines)
>> For the record, these issues should be fixed in Guile 2.2.4:
>>
>> 533e3ff17 * Serialize accesses to submodule hash tables.
>> 46bcbfa56 * Module import obarrays are accessed in a critical section.
>> 761cf0fb8 * Make module autoloading thread-safe.
>>
>> ‘guix pull’ now defaults to 2.2.4, so we’ll see if indeed those crashes
>> disappear.
>
> I think we haven't seen these in the last 4 years! We still have
> references to https://bugs.gnu.org/15602 in our code base though;
> although the upstream issue appears to have been fixed. Could we remove
> the workarounds now?

The module thread-safety issue discussed here appears to be done.

However the workarounds for https://bugs.gnu.org/15602 must remain:
that specific issue is still there.

Thanks,
Ludo’.
M
M
Maxim Cournoyer wrote on 12 Oct 2022 16:24
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 22608-done@debbugs.gnu.org)
87czax15n1.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (24 lines)
> Hi!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> ludo@gnu.org (Ludovic Courtès) writes:
>
> [...]
>
>>> For the record, these issues should be fixed in Guile 2.2.4:
>>>
>>> 533e3ff17 * Serialize accesses to submodule hash tables.
>>> 46bcbfa56 * Module import obarrays are accessed in a critical section.
>>> 761cf0fb8 * Make module autoloading thread-safe.
>>>
>>> ‘guix pull’ now defaults to 2.2.4, so we’ll see if indeed those crashes
>>> disappear.
>>
>> I think we haven't seen these in the last 4 years! We still have
>> references to https://bugs.gnu.org/15602 in our code base though;
>> although the upstream issue appears to have been fixed. Could we remove
>> the workarounds now?
>
> The module thread-safety issue discussed here appears to be done.

Alright, I'm closing this one then.

Toggle quote (3 lines)
> However the workarounds for https://bugs.gnu.org/15602 must remain:
> that specific issue is still there.

Thanks for the heads-up!

--
Maxim
Closed
?