AArch64 Honeycomb builders are inactive

  • Open
  • quality assurance status badge
Details
7 participants
  • Greg Hogan
  • Ludovic Courtès
  • Maxim Cournoyer
  • Maxime Devos
  • Mathieu Othacehe
  • Ricardo Wurmus
  • Tom Fitzhenry
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 8 Jun 2022 17:31
[cuirass] workers stalled
(address . bug-guix@gnu.org)
87h74v2mu7.fsf@gnu.org
Hello,

The aarch64 workers were all idle whereas 70k builds were
available. Once restarted, they started building again.

The problem might be that when the server is unavailable for a while the
worker connections expire and cannot be resumed once the server is
available again.

Thanks,

Mathieu
G
G
Greg Hogan wrote on 8 Jun 2022 21:07
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 55848@debbugs.gnu.org)
CA+3U0ZmY4jcrZ6FPeQCgHHZKpvSh0BZ1ki5KTXfGemQRf-ZOkw@mail.gmail.com
On Wed, Jun 8, 2022 at 11:32 AM Mathieu Othacehe <othacehe@gnu.org> wrote:
Toggle quote (15 lines)
>
>
> Hello,
>
> The aarch64 workers were all idle whereas 70k builds were
> available. Once restarted, they started building again.
>
> The problem might be that when the server is unavailable for a while the
> worker connections expire and cannot be resumed once the server is
> available again.
>
> Thanks,
>
> Mathieu

The recent aarch64 builds look to all be failing with the following message.

===== <cut> =====
substitute:
substitute: [Kupdating substitutes from 'https://ci.guix.gnu.org'...
0.0%guix substitute: error: TLS error in procedure 'handshake': Error
in the pull function.
===== </cut> =====
T
T
Tom Fitzhenry wrote on 11 Jun 2022 12:44
(name . Greg Hogan)(address . code@greghogan.com)
878rq3scn3.fsf@tom-fitzhenry.me.uk
Greg Hogan <code@greghogan.com> writes:

Toggle quote (4 lines)
> On Wed, Jun 8, 2022 at 11:32 AM Mathieu Othacehe <othacehe@gnu.org> wrote:
>> The aarch64 workers were all idle whereas 70k builds were
>> available. Once restarted, they started building again.

From following the builds on http://ci.guix.gnu.org/workers, many
(all?) builds are failing on the following workers:

* grunewald
* kreuzberg
* pankow

The builds are failing with the same error:

"substitute: updating substitutes from 'https://ci.guix.gnu.org'...
0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
the pull function."

Here's some examples:


On worker overdrive1, in the raw log of
rust-async-mutex build managing to pull substitutes, but it
seems to be compiling rust-1.57 itself.
L
L
Ludovic Courtès wrote on 11 Jun 2022 22:33
control message for bug #55848
(address . control@debbugs.gnu.org)
8735gbx7mk.fsf@gnu.org
severity 55848 important
quit
L
L
Ludovic Courtès wrote on 12 Jun 2022 15:33
Re: bug#55848: [cuirass] workers stalled
(name . Tom Fitzhenry)(address . tom@tom-fitzhenry.me.uk)
87bkuyvwf0.fsf@gnu.org
Hi,

(+Cc: guix-sysadmin)

Tom Fitzhenry <tom@tom-fitzhenry.me.uk> skribis:

Toggle quote (13 lines)
>>From following the builds on http://ci.guix.gnu.org/workers, many
> (all?) builds are failing on the following workers:
>
> * grunewald
> * kreuzberg
> * pankow
>
> The builds are failing with the same error:
>
> "substitute: updating substitutes from 'https://ci.guix.gnu.org'...
> 0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
> the pull function."

On these machines, https://ci.guix.gnu.org(among other) is unavailable
for some reason (firewall I guess):

Toggle snippet (17 lines)
ludo@grunewald ~$ wget --debug -O/dev/null https://ci.guix.gnu.org
Setting --output-document (outputdocument) to /dev/null
DEBUG output created by Wget 1.21.1 on linux-gnu.

Reading HSTS entries from /home/ludo/.wget-hsts
URI encoding = ‘UTF-8’
--2022-06-11 22:38:59-- https://ci.guix.gnu.org/
Certificates loaded: 444
Resolving ci.guix.gnu.org (ci.guix.gnu.org)... 141.80.181.40
Caching ci.guix.gnu.org => 141.80.181.40
Connecting to ci.guix.gnu.org (ci.guix.gnu.org)|141.80.181.40|:443... connected.
Created socket 4.
Releasing 0x000000001fd26b50 (new refcount 1).

[Sits there forever…]

These machines are configured using ‘honeycomb-system’ from (sysadmin
honeycomb) in maintenance.git.

guix-daemon is configured to use the default substitute URLs,
https://ci.guix.gnu.organd https://bordeaux.guix.gnu.org, which we know
are unreachable.

I’ve theoretically addressed this here:


I tried to reconfigure those boxes with ‘guix deploy’, but this is
currently on hold because ci.guix has run out of inodes…

To be continued!

Ludo’.
R
R
Ricardo Wurmus wrote on 12 Jun 2022 18:10
(name . Ludovic Courtès)(address . ludo@gnu.org)
871qvt2701.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (22 lines)
> Hi,
>
> (+Cc: guix-sysadmin)
>
> Tom Fitzhenry <tom@tom-fitzhenry.me.uk> skribis:
>
>>>From following the builds on http://ci.guix.gnu.org/workers , many
>> (all?) builds are failing on the following workers:
>>
>> * grunewald
>> * kreuzberg
>> * pankow
>>
>> The builds are failing with the same error:
>>
>> "substitute: updating substitutes from 'https://ci.guix.gnu.org'...
>> 0.0%guix substitute: error: TLS error in procedure 'handshake': Error in
>> the pull function."
>
> On these machines, https://ci.guix.gnu.org (among other) is unavailable
> for some reason (firewall I guess):

They should be using the local IP instead of routing through the
internet, so /etc/hosts should contain an entry for

141.80.167.131 ci.guix.gnu.org

(We have the same entry on the other build nodes hosted at the MDC.)

“guix deploy” did not work on these nodes due to a serious problem: they
were given *some* x86_64 binaries to execute, so deployed systems were
unbootable. Since we don’t have a serial interface through which you
could debug this remotely, please make sure not to deploy a broken
system. I’d like to avoid trips to the data centre.

--
Ricardo
L
L
Ludovic Courtès wrote on 12 Jun 2022 22:22
(name . Ricardo Wurmus)(address . rekado@elephly.net)
87k09lvdh2.fsf@gnu.org
Ricardo Wurmus <rekado@elephly.net> skribis:

Toggle quote (5 lines)
> They should be using the local IP instead of routing through the
> internet, so /etc/hosts should contain an entry for
>
> 141.80.167.131 ci.guix.gnu.org

Good idea.

Toggle quote (6 lines)
> “guix deploy” did not work on these nodes due to a serious problem: they
> were given *some* x86_64 binaries to execute, so deployed systems were
> unbootable. Since we don’t have a serial interface through which you
> could debug this remotely, please make sure not to deploy a broken
> system. I’d like to avoid trips to the data centre.

Ooooh right, thanks for the reminder!

Ludo’.
T
T
Tom Fitzhenry wrote on 19 Jun 2022 04:07
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 55848@debbugs.gnu.org)
875ykxcsnv.fsf@tom-fitzhenry.me.uk
Mathieu Othacehe <othacehe@gnu.org> writes:

Substitutes for aarch64 are a lot healthier now. Thanks Ludovic!

* kreuzberg is now successfully building and has been for a while.
* ci.guix.gnu.has has 41% of substitutes (a low percentage, but likely a
high percentage of toolchains). 0 jobs are queued, presumably because Curiass
believes its up-to-date. This should increase over time, as packages
are updated.
* bordeaux has 83.8% of substitutes.

A few issues remain for aarch64:

* grunewald and kreuzberg are not on https://ci.guix.gnu.org/workers.
Perhaps they were taken down while the substitute ratio was low to
avoid each worker independently recompiling expensive toolchains?
* rust@1.39.0 (and thus all of Rust) is missing from ci and bordeaux. I
had expected this would have been working. I'll take a look and raise
a separate issue.

Toggle snippet (37 lines)
$ ./pre-inst-env guix weather -s aarch64-linux -c2000
computing 15514 package derivations for aarch64-linux...
looking for 16265 store items on https://ci.guix.gnu.org...
https://ci.guix.gnu.org
41.0% substitutes available (6668 out of 16265)
at least 34188.1 MiB of nars (compressed)
45362.5 MiB on disk (uncompressed)
0.015 seconds per request (144.9 seconds in total)
66.2 requests per second

0.0% (0 out of 9597) of the missing items are queued
at least 1000 queued builds
aarch64-linux: 110 (11.0%)
powerpc64le-linux: 890 (89.0%)
build rate: 36.81 builds per hour
aarch64-linux: 17.23 builds per hour
x86_64-linux: 14.25 builds per hour
powerpc64le-linux: 1.01 builds per hour
i686-linux: 4.83 builds per hour
1871 packages are missing from 'https://ci.guix.gnu.org' for 'aarch64-linux', among which:
3479 rust@1.39.0 /gnu/store/xxlgndidxvhdd391k35vcmviixq5d9b0-rust-1.39.0-cargo /gnu/store/cfy1p8q4bwwy1i01cjfssfry21kpljz3-rust-1.39.0
2111 cairomm@1.14.2 /gnu/store/bxknxn3nbmmvavf537k0pggrynhrgsaf-cairomm-1.14.2-doc /gnu/store/3sn66mgr29v73zpp93c2v09a0rj87l3w-cairomm-1.14.2
2101 texlive-latex-pgf@59745 /gnu/store/l6jr7v8ygn3ybj4gxcwskf8ifsjcj6x1-texlive-latex-pgf-59745
looking for 16265 store items on https://bordeaux.guix.gnu.org...
https://bordeaux.guix.gnu.org
83.8% substitutes available (13624 out of 16265)
35138.6 MiB of nars (compressed)
109501.6 MiB on disk (uncompressed)
0.060 seconds per request (699.4 seconds in total)
16.7 requests per second
(continuous integration information unavailable)
579 packages are missing from 'https://bordeaux.guix.gnu.org' for 'aarch64-linux', among which:
3479 rust@1.39.0 /gnu/store/xxlgndidxvhdd391k35vcmviixq5d9b0-rust-1.39.0-cargo /gnu/store/cfy1p8q4bwwy1i01cjfssfry21kpljz3-rust-1.39.0



Toggle quote (12 lines)
> Hello,
>
> The aarch64 workers were all idle whereas 70k builds were
> available. Once restarted, they started building again.
>
> The problem might be that when the server is unavailable for a while the
> worker connections expire and cannot be resumed once the server is
> available again.
>
> Thanks,
>
> Mathieu
M
M
Maxim Cournoyer wrote on 20 Jun 2022 04:39
(name . Tom Fitzhenry)(address . tom@tom-fitzhenry.me.uk)
878rps83ec.fsf@gmail.com
Hi Mathieu!

[...]

Toggle quote (9 lines)
> A few issues remain for aarch64:
>
> * grunewald and kreuzberg are not on <https://ci.guix.gnu.org/workers>.
> Perhaps they were taken down while the substitute ratio was low to
> avoid each worker independently recompiling expensive toolchains?
> * rust@1.39.0 (and thus all of Rust) is missing from ci and bordeaux. I
> had expected this would have been working. I'll take a look and raise
> a separate issue.

That's a known issue with mrustc; it only succeeds with x86_64; the
other architectures have problems. That's a bug the mrustc author would
like to fix, so perhaps in time in will improve (especially if
interested parties can lend a hand).

There was also an attempt to cross-compile a rust/cargo bootstrap seed
for other architectures (branch: wip-cross-built-rust) but due to
complications with building rust as a static archive (it relies on
dynamic linking for its macro expand crates), the effort stalled.

Thanks,

Maxim
T
T
Tom Fitzhenry wrote on 20 Jun 2022 04:44
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
173fb399-9db1-40de-b8bc-662f1f1736d2@www.fastmail.com
On Mon, 20 Jun 2022, at 12:39 PM, Maxim Cournoyer wrote:
Toggle quote (5 lines)
> That's a known issue with mrustc; it only succeeds with x86_64; the
> other architectures have problems. That's a bug the mrustc author would
> like to fix, so perhaps in time in will improve (especially if
> interested parties can lend a hand).

mrustc was fixed on aarch64 in https://issues.guix.gnu.org/54580on staging, which was recently merged to master.

I had tested mrustc and rust-1.39 to compile on aarch64 on staging, but now I observe rust-1.39 failing.

I'll take a closer look, maybe I'm missing something.
M
M
Maxime Devos wrote on 20 Jun 2022 15:02
40c9de93c11d0b93a2df2b23ef6d1a4b56eeac0b.camel@telenet.be
Maxim Cournoyer schreef op zo 19-06-2022 om 22:39 [-0400]:
Toggle quote (5 lines)
> There was also an attempt to cross-compile a rust/cargo bootstrap seed
> for other architectures (branch: wip-cross-built-rust) but due to
> complications with building rust as a static archive (it relies on
> dynamic linking for its macro expand crates), the effort stalled.

FWIW, has it been considered to cross-compile rust non-statically
(not as a seed, just as an input cross-compiled from another system)?
Doesn't help for people that cannot offload to x86_64 and don't have
substitutes from ci.guix.gnu.org or such enabled, but could still be an
improvement.

Greetings,
Maxime.
-----BEGIN PGP SIGNATURE-----

iI0EABYKADUWIQTB8z7iDFKP233XAR9J4+4iGRcl7gUCYrBv/xccbWF4aW1lZGV2
b3NAdGVsZW5ldC5iZQAKCRBJ4+4iGRcl7u+yAQDTZUeNLi0FUkrDMxT/9k5cyT1o
Yn9cB1g5BXP9wlMAlQEAgiLmMDvZ+iNNcHhW5Je62xSy11mSx/KHLcnw5jhfzQs=
=G9OT
-----END PGP SIGNATURE-----


M
M
Maxim Cournoyer wrote on 21 Jun 2022 07:32
(name . Maxime Devos)(address . maximedevos@telenet.be)
87zgi67f9q.fsf@gmail.com
Hi Maxime,

Maxime Devos <maximedevos@telenet.be> writes:

Toggle quote (12 lines)
> Maxim Cournoyer schreef op zo 19-06-2022 om 22:39 [-0400]:
>> There was also an attempt to cross-compile a rust/cargo bootstrap seed
>> for other architectures (branch: wip-cross-built-rust) but due to
>> complications with building rust as a static archive (it relies on
>> dynamic linking for its macro expand crates), the effort stalled.
>
> FWIW, has it been considered to cross-compile rust non-statically
> (not as a seed, just as an input cross-compiled from another system)?
> Doesn't help for people that cannot offload to x86_64 and don't have
> substitutes from ci.guix.gnu.org or such enabled, but could still be an
> improvement.

This already works, on the branch. One of the patches carried there
that made it possible has been merged upstream too. The issue is that
to offer a useful cross-compiled rust on non-x86_64 systems, you need to
move it from system domains; the clean way to do this is to archive a
static binary that depends on nothing else somewhere, and extract it in
a package for the target architecture.

Currently it's not cleanly self-contained because it still references
GCC libraries.

Maxim
L
L
Ludovic Courtès wrote on 10 Aug 2022 15:28
control message for bug #55848
(address . control@debbugs.gnu.org)
87iln01b4g.fsf@gnu.org
retitle 55848 AArch64 Honeycomb builders are inactive
quit
L
L
Ludovic Courtès wrote on 10 Aug 2022 15:46
AArch64 honeycomb machines aren’t building stuff
(name . Tom Fitzhenry)(address . tom@tom-fitzhenry.me.uk)
87wnbgyzwl.fsf_-_@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (12 lines)
> guix-daemon is configured to use the default substitute URLs,
> https://ci.guix.gnu.org and https://bordeaux.guix.gnu.org, which we know
> are unreachable.
>
> I’ve theoretically addressed this here:
>
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=99bd9dc9001d6bea7480a7ce0e0e10ff78adb787
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=b0661cc7d6dd74b0aeac3b052a80a8a2fef2af9c
>
> I tried to reconfigure those boxes with ‘guix deploy’, but this is
> currently on hold because ci.guix has run out of inodes…

Time passed and I had kinda forgotten about it, but the problem remains.

I’m currently reconfiguring pankow and grunewald? from berlin with ‘guix
deploy’ to include the fix above¹, but it’s gonna take a while as it’s
currently building GCC…

To do that, I had to ‘herd stop guix-daemon’ (thereby stopping
‘cuirass-remote worker’ as well) and run guix-daemon by hand with
‘--substitute-urls=http://10.0.0.1’.

While doing that with Guix 9e4632081ff31bf0d1715edd66f514614c6dc4bb, I
found another bug² (yup, it does look like an endless quest, even more
so that I’ll soon be going AFK and it’s not clear that things will be
settled by then!).

Cheers,
Ludo’, aka. el Quijote.

? More on kreuzberg in a separate message…

¹ For the record, previously ‘guix deploy’ had a bug whereby running it
from an x86_64 box like berlin would lead it to send x86_64 binaries
(instead of AArch64 binaries) to the machines. This was fixed in
7046e777212233b89df68379c270b448c45195ce:

R
R
Ricardo Wurmus wrote on 10 Aug 2022 19:55
(name . Ludovic Courtès)(address . ludo@gnu.org)
87czd8ugl7.fsf@elephly.net
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (16 lines)
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> guix-daemon is configured to use the default substitute URLs,
>> https://ci.guix.gnu.org and https://bordeaux.guix.gnu.org, which we know
>> are unreachable.
>>
>> I’ve theoretically addressed this here:
>>
>> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=99bd9dc9001d6bea7480a7ce0e0e10ff78adb787
>> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=b0661cc7d6dd74b0aeac3b052a80a8a2fef2af9c
>>
>> I tried to reconfigure those boxes with ‘guix deploy’, but this is
>> currently on hold because ci.guix has run out of inodes…
>
> Time passed and I had kinda forgotten about it, but the problem remains.

I wrote this earlier:

Toggle quote (5 lines)
> They should be using the local IP instead of routing through the
> internet, so /etc/hosts should contain an entry for
>
> 141.80.167.131 ci.guix.gnu.org

So running the daemon with “--substitute-urls=http://10.0.0.1” should
not be necessary.

--
Ricardo
L
L
Ludovic Courtès wrote on 11 Aug 2022 16:06
Re: bug#55848: [cuirass] workers stalled
(name . Ricardo Wurmus)(address . rekado@elephly.net)
87h72iub65.fsf_-_@gnu.org
Hi,

Ricardo Wurmus <rekado@elephly.net> skribis:

Toggle quote (28 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Ludovic Courtès <ludo@gnu.org> skribis:
>>
>>> guix-daemon is configured to use the default substitute URLs,
>>> https://ci.guix.gnu.org and https://bordeaux.guix.gnu.org, which we know
>>> are unreachable.
>>>
>>> I’ve theoretically addressed this here:
>>>
>>> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=99bd9dc9001d6bea7480a7ce0e0e10ff78adb787
>>> https://git.savannah.gnu.org/cgit/guix/maintenance.git/commit/?id=b0661cc7d6dd74b0aeac3b052a80a8a2fef2af9c
>>>
>>> I tried to reconfigure those boxes with ‘guix deploy’, but this is
>>> currently on hold because ci.guix has run out of inodes…
>>
>> Time passed and I had kinda forgotten about it, but the problem remains.
>
> I wrote this earlier:
>
>> They should be using the local IP instead of routing through the
>> internet, so /etc/hosts should contain an entry for
>>
>> 141.80.167.131 ci.guix.gnu.org
>
> So running the daemon with “--substitute-urls=http://10.0.0.1” should
> not be necessary.

Oh my bad, sorry for overlooking your message.

Explicitly going through http://10.0.0.1is still desirable I think
because we avoid HTTPS altogether.

‘guix deploy’ is still running on berlin.guix and building things;
unfortunately I’m going AFK for a bit. I’ll pick it up later unless
someone takes care of it by then.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 29 Aug 2022 15:30
Re: bug#55848: AArch64 Honeycomb builders are inactive
(address . 55848@debbugs.gnu.org)
874jxv6uu8.fsf_-_@gnu.org
Hello!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (4 lines)
> I’m currently reconfiguring pankow and grunewald? from berlin with ‘guix
> deploy’ to include the fix above¹, but it’s gonna take a while as it’s
> currently building GCC…

That eventually succeeded and pankow is now reconfigured with the right
daemon settings. \o/

In the meantime, grunewald went off-line so I can’t tell if it’s
properly reconfigured, and kreuzberg is still running the old config I
believe (I cannot log in).

Ricardo, do you have access to these two?

Cheers,
Ludo’.
?