[version 1.4.0rc1] Retrying a failed install fails

  • Done
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Maxime Devos
  • pelzflorian (Florian Pelz)
Owner
unassigned
Submitted by
pelzflorian (Florian Pelz)
Severity
normal
P
P
pelzflorian (Florian Pelz) wrote on 2 Dec 2022 18:52
(address . bug-guix@gnu.org)
87y1rpy9p4.fsf@pelzflorian.de
I aborted graphical system installation (Ctrl-C), retried the
installation and got this:

shepherd: Service guix-daemon has been stopped.
shepherd: Service guix-daemon has been started.
guix system: Fehler: opening file `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or directory
Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet
Drücken Sie die Eingabetaste, um fortzufahren.

(It told me to press Enter to continue.) I did so; retried; but again
it did not really retry the installation, I always get this same error
message.

Sorry in case this is a duplicate bug.

Regards,
Florian
L
L
Ludovic Courtès wrote on 9 Dec 2022 10:42
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)(address . 59784@debbugs.gnu.org)
87mt7wap6v.fsf@gnu.org
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (13 lines)
> I aborted graphical system installation (Ctrl-C), retried the
> installation and got this:
>
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such file or directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet
> Drücken Sie die Eingabetaste, um fortzufahren.
>
> (It told me to press Enter to continue.) I did so; retried; but again
> it did not really retry the installation, I always get this same error
> message.

Related to that, I found this old bug:


I tried to reproduce it:

0. I chose a basic installation to a fully-encrypted disk with a
single partition.

1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.

2. That led me to a confusing error screen says “Command cryptsetup
failed” with Ignore/Abort/Retry buttons. This should have been
“Command guix system init” failed no?

3. I resumed starting with the “Configuration File” step, and there
‘guix system init’ ran to completion just fine.

Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
already started copying stuff to /mnt?

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 9 Dec 2022 12:11
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)(address . 59784@debbugs.gnu.org)
878rjgal1s.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (3 lines)
> 2. That led me to a confusing error screen says “Command cryptsetup
> failed” with Ignore/Abort/Retry buttons.

Actually it’s “External command ("cryptsetup" "close" "cryptroot")
exited with code 5” and “cryptroot device is busy”.

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 10 Dec 2022 09:39
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 59784@debbugs.gnu.org)
87y1rf649x.fsf@pelzflorian.de
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (14 lines)
> I tried to reproduce it:
>
> 0. I chose a basic installation to a fully-encrypted disk with a
> single partition.
>
> 1. I hit Ctrl-C while ‘guix system init’ was downloading substitutes.
>
> 2. That led me to a confusing error screen says “Command cryptsetup
> failed” with Ignore/Abort/Retry buttons. This should have been
> “Command guix system init” failed no?
>
> 3. I resumed starting with the “Configuration File” step, and there
> ‘guix system init’ ran to completion just fine.

Yes, these were the steps, except I did not do encryption. But I had
not told the whole story … Sorry!

So what was missing is that the reason I pressed Ctrl-C was a rare
dropout by my Ethernet controller. Because it is so rare and has not
happened anymore since, as a substitute, for reproducing, I did as
follows:

0. Use Ethernet for the installation.

1. During substitute downloading, pull the Ethernet plug.

2. Get lucky so the installation will crash with an error and not just
pause. Otherwise, if no crash, repeat.

3. Press Ctrl-C.

4. Resume the installation from the last step.

5. It will fail now.

I now uploaded an installer-dump-bade9971 of me reproducing the issue.

Toggle quote (3 lines)
> Maybe the difference is that you hit Ctrl-C when ‘guix system init’ had
> already started copying stuff to /mnt?

No, like you, I was in the substitute downloading step.

This issue is much rarer than I thought.

Thank you for investigating.

Regards,
Florian
P
P
pelzflorian (Florian Pelz) wrote on 12 Dec 2022 13:07
(address . 59784@debbugs.gnu.org)
87edt4su3y.fsf@pelzflorian.de
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> writes:
Toggle quote (7 lines)
> shepherd: Service guix-daemon has been stopped.
> shepherd: Service guix-daemon has been started.
> guix system: Fehler: opening file
> `/gnu/store/4z81a7njyvnwa4kn46ad6vhvi0lcnrhh-shadow-4.9.drv': No such
> file or directory
> Befehl ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") hat mit Exit-Code 1 geendet

Still happens with 1.4.0rc2. I guess install-system in
gnu/installer/final.scm does not sync the disk on failure?

Regards,
Florian
L
L
Ludovic Courtès wrote on 13 Dec 2022 10:40
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87pmcny73z.fsf@gnu.org
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (2 lines)
> I now uploaded an installer-dump-bade9971 of me reproducing the issue.

Here’s the relevant syslog excerpt (this was with 1.4.0rc1) where we can
see the point where you unplugged the Ethernet connection:

Toggle snippet (45 lines)
Dec 10 09:07:29 localhost installer[399]: running command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt")
Dec 10 09:07:48 localhost installer[399]: ^[[1m10.3 MB will be downloaded^M
Dec 10 09:07:49 localhost installer[399]: ^[[0m^M^[[K^M^[[K utf8proc-2.5.0 52KiB 716KiB/s 00:00 [##################] 100.0%^M^[[K utf8proc-2.5.0 52KiB 594KiB/s 00:00 [##################] 100.0%^M

[...]

Dec 10 09:08:48 localhost installer[399]: ^[[0m^M^[[Kretrying download of '/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20' with other substitute URLs...^M
Dec 10 09:08:48 localhost installer[399]: guix substitute: warning: bordeaux.guix.gnu.org: host not found: Name or service not known^M
Dec 10 09:08:48 localhost installer[399]: guix substitute: error: failed to find alternative substitute for '/gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20'^M
Dec 10 09:08:48 localhost installer[399]: ^[[31;1msubstitution of /gnu/store/8zigz7afvz2rjrvrh7zq1d389qbl2684-dbus-1.12.20 failed^[[0m^M
Dec 10 09:08:49 localhost installer[399]: ^M^[[K^M^[[Kretrying download of '/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash' with other substitute URLs...^M
Dec 10 09:08:49 localhost installer[399]: guix substitute: warning: bordeaux.guix.gnu.org: host not found: Name or service not known^M
Dec 10 09:08:49 localhost installer[399]: guix substitute: error: failed to find alternative substitute for '/gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash'^M
Dec 10 09:08:49 localhost installer[399]: ^[[31;1msubstitution of /gnu/store/mzfkrxd4w8vqrmyrx169wj8wyw7r8i37-bash failed^[[0m^M
Dec 10 09:08:49 localhost installer[399]: guix system: ^[[1;31merror: ^[[0mcorrupt input while restoring archive from #<closed: file 7fa02f84d4d0>^M
Dec 10 09:08:49 localhost installer[399]: command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1
Dec 10 09:08:58 localhost vmunix: [ 1220.571986] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off

[...]

Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been stopped.
Dec 10 09:09:12 localhost shepherd[1]: Service guix-daemon has been started.
Dec 10 09:09:17 localhost installer[274]: unmounting "/mnt/"
Dec 10 09:09:17 localhost vmunix: [ 1239.111442] EXT4-fs (sda3): unmounting filesystem.
Dec 10 09:09:19 localhost installer[274]: running form #<newt-form 2499c90> ("Installation menu") with 0 clients
Dec 10 09:09:22 localhost installer[274]: running step 'final'
Dec 10 09:09:22 localhost installer[274]: proceeding with final step
Dec 10 09:09:23 localhost installer[274]: mounting "/dev/sda3" on "/mnt/"
Dec 10 09:09:23 localhost vmunix: [ 1245.890840] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none.
Dec 10 09:09:23 localhost vmunix: [ 1245.893304] Adding 3905532k swap on /dev/sda2. Priority:-2 extents:1 across:3905532k SSFS
Dec 10 09:09:23 localhost installer[274]: running form #<newt-form 248c440> ("Configuration file") with 0 clients
Dec 10 09:09:29 localhost installer[437]: install supported locale en_US.utf8.
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been stopped.
Dec 10 09:09:29 localhost shepherd[1]: Service guix-daemon has been started.
Dec 10 09:09:29 localhost installer[437]: running command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt")
Dec 10 09:09:54 localhost installer[437]: ^[[1m60.8 MB will be downloaded^M
Dec 10 09:09:54 localhost installer[437]: ^[[0mguix system: ^[[1;31merror: ^[[0mopening file `/gnu/store/igxf1b1l2b19h7mx2s6r117270dbi6iq-guix-1.4.0rc1.drv': No such file or directory^M
Dec 10 09:09:54 localhost installer[437]: command ("guix" "system" "init" "--fallback" "/mnt/etc/config.scm" "/mnt") exited with value 1
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been stopped.
Dec 10 09:10:21 localhost shepherd[1]: Service guix-daemon has been started.
Dec 10 09:10:21 localhost installer[274]: unmounting "/mnt/"
Dec 10 09:10:21 localhost vmunix: [ 1303.398583] EXT4-fs (sda3): unmounting filesystem.
Dec 10 09:10:28 localhost installer[274]: crashing due to uncaught exception: %exception (#<&user-abort-error>)

It looks like the store is in a broken state, with its database not
matching its actual contents. The ‘install-system’ procedure is
supposed to protect against that by making a backup of the database
before starting the installation and restoring it afterwards. (It
apparently worked for me when I interrupted ‘guix system init’ by
hitting C-c.)

I wonder how that failed here. Mathieu, ideas?

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 13 Dec 2022 10:48
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87k02vy6pw.fsf@gnu.org
Hi again,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (7 lines)
> It looks like the store is in a broken state, with its database not
> matching its actual contents. The ‘install-system’ procedure is
> supposed to protect against that by making a backup of the database
> before starting the installation and restoring it afterwards. (It
> apparently worked for me when I interrupted ‘guix system init’ by
> hitting C-c.)

Actually, look at the excerpt from final.scm:

;; Restart guix-daemon so that it does no keep the MNT namespace
;; alive.
(restart-service 'guix-daemon)
(copy-file saved-database database-file)

We’re restarting the daemon *before* we have restored the database,
which is wrong: depending on how lucky you are, guix-daemon might load
the old database (all this depends on what exactly happens when sqlite
opens the database, but I think there’s a possibility that it will load
or cache a few things and thus fail to see the changes ‘copy-file’
introduces.)

So my guess is that things will be much better if we swap these two
lines.

Florian, it would be great if you could try that and run a new image
generated version ‘version-1.4.0’ with these two lines changed. To
produce the image, run:

./pre-inst-env guix system image -t iso9660 --label=Guix \
gnu/system/install.scm

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 13 Dec 2022 23:22
(name . Ludovic Courtès)(address . ludo@gnu.org)
87ilifeyg1.fsf@pelzflorian.de
Hi again.

Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (3 lines)
> So my guess is that things will be much better if we swap these two
> lines.

This was helpful, but not enough.

Swapping them may have improved the likelihood of being able to retry,
but the issue is still there. I uploaded as installer-dump-5f9f8dbe,
but it is pretty much the same as the previous dump.

Tomorrow, I will try to add an fsync call in between the two lines.

Toggle quote (3 lines)
> ./pre-inst-env guix system image -t iso9660 --label=Guix \
> gnu/system/install.scm

Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
make update-guix-package”. Or else the installer was using a Guix that
did not have the lines swapped.

Also before I did the GPG authorization dance (my x86 machine isn’t
worth getting my actual commiter GPG keys, so I make sure its dummy GPG
key is in the keyring branch, .guix-authorizations file, that
guix/channels.scm’s default guix channel points to the url
/home/florian/src/guix and to the commit with the new authorization).
Then I guix pulled. So that building the installer succeeds. I did
*not* use ./pre-inst-env.

Regards,
Florian
L
L
Ludovic Courtès wrote on 14 Dec 2022 00:16
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87h6xyvqr6.fsf@gnu.org
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (6 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>> So my guess is that things will be much better if we swap these two
>> lines.
>
> This was helpful, but not enough.

Sorry, I think I wasn’t thinking at full speed. There needs to be zero
daemons running while we copy the database. So the real fix is more
like this:
Toggle diff (20 lines)
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..9a6bbad122 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -213,10 +213,13 @@ (define (assert-exit x)
(set! ret (run-command install-command #:tty? #t)))
(lambda ()
- ;; Restart guix-daemon so that it does no keep the MNT namespace
+ ;; Stop guix-daemon so that it does no keep the MNT namespace
;; alive.
- (restart-service 'guix-daemon)
+ (stop-service 'guix-daemon)
+
+ ;; Restore the database and restart it.
(copy-file saved-database database-file)
+ (start-service 'guix-daemon)
;; Finally umount the cow-store and exit the container.
(unmount-cow-store (%installer-target-dir) backing-directory)
Toggle quote (7 lines)
>> ./pre-inst-env guix system image -t iso9660 --label=Guix \
>> gnu/system/install.scm
>
> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
> make update-guix-package”. Or else the installer was using a Guix that
> did not have the lines swapped.

Hmm this is surprising because we’re already using (current-guix) in
(gnu installer).

Toggle quote (8 lines)
> Also before I did the GPG authorization dance (my x86 machine isn’t
> worth getting my actual commiter GPG keys, so I make sure its dummy GPG
> key is in the keyring branch, .guix-authorizations file, that
> guix/channels.scm’s default guix channel points to the url
> /home/florian/src/guix and to the commit with the new authorization).
> Then I guix pulled. So that building the installer succeeds. I did
> *not* use ./pre-inst-env.

Ah yes, apologies. You should be able to disable authentication with
this:
Toggle diff (12 lines)
diff --git a/gnu/packages/package-management.scm b/gnu/packages/package-management.scm
index 5a09b1fcf8..374b187d8c 100644
--- a/gnu/packages/package-management.scm
+++ b/gnu/packages/package-management.scm
@@ -625,6 +625,7 @@ (define-public current-guix-package
(inherit guix)
(source source)
(build-system channel-build-system)
+ (arguments '(#:authenticate? #f))
(inputs '())
(native-inputs '())
(propagated-inputs '())))
Thanks a lot for patiently testing, this is very helpful!

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 14 Dec 2022 14:36
(name . Ludovic Courtès)(address . ludo@gnu.org)
878rja150v.fsf@pelzflorian.de
Eventual success, partially.

First of all:

Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (7 lines)
> "pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:
>> Additionally, I had to do “GUIX_ALLOW_ME_TO_USE_PRIVATE_COMMIT=y
>> make update-guix-package”. Or else the installer was using a Guix that
>> did not have the lines swapped.
> Hmm this is surprising because we’re already using (current-guix) in
> (gnu installer).

Apparently no. If I commit only those two diffs from your mail, with
`./pre-inst-env guix system image -t iso9660 --label=Guix
gnu/system/install.scm`, then

guix gc --references /gnu/store/*-installer-real

prints a Guix package that does not contain any of the changes to
gnu/installer/final.scm.

Nonetheless I used it and ran the installer with surprising failures
that make me doubt either the health of my USB drive: `guix system
init --fallback` did not download substitutes but said ACL seems to be
uninitialized and fell back to downloading/building the tar.xz
sources. I pulled the Ethernet plug, resumed the installer to run
`guix system init` again, but this now complains that nss-certs is an
unknown package. Sending a dump crashed the installer. On TTY3, `ls
/tmp` tells me '-bash: ls: command not found'.

Another USB drive, another try, the installer again says there's no
ACL and downloads tar.xz, but otherwise behaves as rc2 and sometimes
bugs out when pulling Ethernet; final.scm does not contain the patch.

Is that second diff of yours perhaps really about ACLs?

I do the authorization dance, commit the diff about 'stop-service' and
the update-guix-package, then pull --branch=version-1.4.0. I can now
resume happily, when pulling the Ethernet and even when pressing
Ctrl-C just for fun.

Except it is necessary to resume twice. The first resume always fails
and the second resume resumes. Does it confuse the two databases?

Except after a large number of resumes, not even the second resume
resumes anymore. I sent a installer-dump-c82c7abf.

I shall try with fsync now.

Regards,
Florian
P
P
pelzflorian (Florian Pelz) wrote on 14 Dec 2022 22:47
(name . Ludovic Courtès)(address . ludo@gnu.org)
87h6xxpsil.fsf@pelzflorian.de
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> writes:
Toggle quote (2 lines)
> I shall try with fsync now.

fsyncing the database had no effect. (In addition to Ludo’s
'stop-service', I had done
Toggle diff (22 lines)
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index ef487805f0..13deffef85 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -217,8 +217,16 @@ (define (assert-exit x)
;; alive.
(stop-service 'guix-daemon)
- ;; Restore the database and restart it.
+ ;; Restore the database.
(copy-file saved-database database-file)
+
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+ (fd (open database-file flags)))
+ (fsync fd)
+ (close fd))
+
+ ;; And restart guix-daemon.
(start-service 'guix-daemon)
;; Finally umount the cow-store and exit the container.
The same two problems:

* If I resume a crashed installer, I need to resume twice because the
first resume always fails immediately.

* With bad luck, it permanently fails, even a second, third, fourth,
fifth time fail.

This is the same as without the fsync. Fsync had no effect. Still I
uploaded installer-dump-194618fa.

Regards,
Florian
L
L
Ludovic Courtès wrote on 15 Dec 2022 00:50
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87sfhhr1du.fsf@gnu.org
Grrr, I’m really silly: we have the same problem (copying the database
before the daemon has been stopped) just a few lines above.

How about this:
Toggle diff (47 lines)
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -1,6 +1,6 @@
;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2018, 2020 Mathieu Othacehe <m.othacehe@gmail.com>
-;;; Copyright © 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2019, 2020, 2022 Ludovic Courtès <ludo@gnu.org>
;;;
;;; This file is part of GNU Guix.
;;;
@@ -196,14 +196,15 @@ (define (assert-exit x)
;; the loaded cow-store locale files will prevent umounting.
(install-locale locale)
- ;; Save the database, so that it can be restored once the
- ;; cow-store is umounted.
+ ;; Stop the daemon and save the database, so that it can be
+ ;; restored once the cow-store is umounted.
+ (stop-service 'guix-daemon)
(copy-file database-file saved-database)
+
(mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
;; We need to drag the guix-daemon to the container MNT
;; namespace, so that it can operate on the cow-store.
- (stop-service 'guix-daemon)
(start-service 'guix-daemon (list (number->string (getpid))))
(setvbuf (current-output-port) 'none)
@@ -213,10 +214,13 @@ (define (assert-exit x)
(set! ret (run-command install-command #:tty? #t)))
(lambda ()
- ;; Restart guix-daemon so that it does no keep the MNT namespace
+ ;; Stop guix-daemon so that it does no keep the MNT namespace
;; alive.
- (restart-service 'guix-daemon)
+ (stop-service 'guix-daemon)
+
+ ;; Restore the database and restart it.
(copy-file saved-database database-file)
+ (start-service 'guix-daemon)
;; Finally umount the cow-store and exit the container.
(unmount-cow-store (%installer-target-dir) backing-directory)
?

This time, I believe we only ever copy the database when we’re sure no
guix-daemon process is accessing it.

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 15 Dec 2022 18:46
(name . Ludovic Courtès)(address . ludo@gnu.org)
871qp05zmf.fsf@pelzflorian.de
Hi Ludo…

Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (3 lines)
> This time, I believe we only ever copy the database when we’re sure no
> guix-daemon process is accessing it.

Failure. In addition to your partially helpful patch from before
(with which a second resume now works most of the time), I now tried
further the new change:

Toggle diff (37 lines)
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 044f79372b..360b34d8cb 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -196,14 +196,15 @@ (define (assert-exit x)
;; the loaded cow-store locale files will prevent umounting.
(install-locale locale)

- ;; Save the database, so that it can be restored once the
- ;; cow-store is umounted.
+ ;; Stop the daemon and save the database, so that it can be
+ ;; restored once the cow-store is umounted.
+ (stop-service 'guix-daemon)
(copy-file database-file saved-database)
+
(mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
;; We need to drag the guix-daemon to the container MNT
;; namespace, so that it can operate on the cow-store.
- (stop-service 'guix-daemon)
(start-service 'guix-daemon (list (number->string (getpid))))

(setvbuf (current-output-port) 'none)


No additional effect. :( Perhaps at that time, the guix-daemon isnt
doing anything anyway (though the addition makes sense in general and
may help some users). There are the same two problems, needing to
resume twice each time and eventually not being able to resume at all
(perhaps some multi-core issue?). I sent installer-dump-89be04d5.

I tried interrupting the Ethernet on the same machine but with an
installed 1.4.0rc2 Guix System during `guix system reconfigure`.
This has no issues… There must be corruption in the installer.

Regards,
Florian
P
P
pelzflorian (Florian Pelz) wrote on 15 Dec 2022 21:44
(name . Ludovic Courtès)(address . ludo@gnu.org)
87v8mcz9ai.fsf@pelzflorian.de
Desperately I tried also adding fsync, to no avail, both issues remain.
Non-working patch attached.

Maybe dynamic-wind is an inappropriate pattern here?

If I interrupt installation using Ctrl-C (which I normally don’t,
instead I unplug Ethernet), then I have to press Ctrl-C twice. Maybe
that could be related to why I need to resume twice?

I’m in the dark.

Regards,
Florian
does not help

Toggle diff (35 lines)
diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
index 5f720f6641..f5935a29c9 100644
--- a/gnu/installer/final.scm
+++ b/gnu/installer/final.scm
@@ -201,6 +201,12 @@ (define (assert-exit x)
(stop-service 'guix-daemon)
(copy-file database-file saved-database)
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+ (fd (open saved-database flags)))
+ (fsync fd)
+ (close fd))
+
(mount-cow-store (%installer-target-dir) backing-directory))
(lambda ()
;; We need to drag the guix-daemon to the container MNT
@@ -218,8 +224,16 @@ (define (assert-exit x)
;; alive.
(stop-service 'guix-daemon)
- ;; Restore the database and restart it.
+ ;; Restore the database.
(copy-file saved-database database-file)
+
+ ;; Sync it to the filesystem.
+ (let* ((flags O_RDONLY)
+ (fd (open database-file flags)))
+ (fsync fd)
+ (close fd))
+
+ ;; And restart guix-daemon.
(start-service 'guix-daemon)
;; Finally umount the cow-store and exit the container.
M
M
Maxime Devos wrote on 16 Dec 2022 14:55
229fab0a-644d-a69a-356e-0c5f074d978a@telenet.be
On 14-12-2022 22:47, pelzflorian (Florian Pelz) wrote:
Toggle quote (24 lines)
> fsyncing the database had no effect. (In addition to Ludo’s
> 'stop-service', I had done
>
>
> fsync.patch
>
> diff --git a/gnu/installer/final.scm b/gnu/installer/final.scm
> index ef487805f0..13deffef85 100644
> --- a/gnu/installer/final.scm
> +++ b/gnu/installer/final.scm
> @@ -217,8 +217,16 @@ (define (assert-exit x)
> ;; alive.
> (stop-service 'guix-daemon)
>
> - ;; Restore the database and restart it.
> + ;; Restore the database.
> (copy-file saved-database database-file)
> +
> + ;; Sync it to the filesystem.
> + (let* ((flags O_RDONLY)
> + (fd (open database-file flags)))
> + (fsync fd)
> + (close fd))
> +
So, I'm nominally 'on hiatus', but I noticed this mail, and noticed you
copied a file (and fsync'ed it), but forgot to fsync the directory it
was copied to -- from what I've read (but I don't recall the source),
fsyncing the contents of the file isn't enough, you also need to fsync
the directory such that the new file entry is in the directory after
crashing.
Greetings,
Maxime.
Attachment: OpenPGP_signature
L
L
Ludovic Courtès wrote on 16 Dec 2022 17:57
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87pmcji8wn.fsf@gnu.org
Hi,

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (9 lines)
> Desperately I tried also adding fsync, to no avail, both issues remain.
> Non-working patch attached.
>
> Maybe dynamic-wind is an inappropriate pattern here?
>
> If I interrupt installation using Ctrl-C (which I normally don’t,
> instead I unplug Ethernet), then I have to press Ctrl-C twice. Maybe
> that could be related to why I need to resume twice?

One finding: when hitting C-c, the dynamic-wind exit handler (the one
that restores the database and umounts the cow store) is *not* executed.

This is because ‘call-with-mnt-container’ sets a SIGINT handler that
terminates that process with SIGKILL (I’m not entirely sure of the
rationale, but said process cannot handle signals in Scheme while it’s
in ‘waitpid’, called from ‘run-command’).

I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
tty, or by killing the ‘guix substitute’ process, to cause failure of
‘guix system init’. In that case the database is indeed restored, but I
occasionally get errors like “/gnu/store/….drv: No such file or
directory”.

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 16 Dec 2022 21:17
(name . Maxime Devos)(address . maximedevos@telenet.be)
875yeb85nd.fsf@pelzflorian.de
Maxime Devos <maximedevos@telenet.be> writes:
Toggle quote (7 lines)
> So, I'm nominally 'on hiatus', but I noticed this mail, and noticed
> you copied a file (and fsync'ed it), but forgot to fsync the directory
> it was copied to -- from what I've read (but I don't recall the
> source), fsyncing the contents of the file isn't enough, you also need
> to fsync the directory such that the new file entry is in the
> directory after crashing.

Ohh indeed! The Linux manpage on fsync confirms it. That invalidates
my fsync testing. Which was on a codepath that, as Ludo found out, did
not even run. But I will remember to fsync the directory in the future.

Thank you very much Maxime!

Regards,
Florian
P
P
pelzflorian (Florian Pelz) wrote on 16 Dec 2022 21:28
(name . Ludovic Courtès)(address . ludo@gnu.org)
871qoz855e.fsf@pelzflorian.de
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (3 lines)
> One finding: when hitting C-c, the dynamic-wind exit handler (the one
> that restores the database and umounts the cow store) is *not* executed.

Impressive findings.

Now that you found the dynamic-wind’s out-guard does not even run: Uhh I
had misdiagnosed when I thought your 'stop-service' patch had made a
difference and caused a second resume to work. Second resume was
already possible on rc2. Except eventually resume stops working and on
some install attempts with rc2, resume stops working right away.

After seeing that you opened a bug#60116 on setsid(), I tested removing
the setsid call and it had no effect, but if the dynamic-wind’s
out-guard does not even run, that is to be expected.


Toggle quote (6 lines)
> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’. In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

Yes, this is the error message that I get on failing resumes.

Regards,
Florian
L
L
Ludovic Courtès wrote on 17 Dec 2022 12:01
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87k02qi99r.fsf@gnu.org
Moin!

"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (8 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>> One finding: when hitting C-c, the dynamic-wind exit handler (the one
>> that restores the database and umounts the cow store) is *not* executed.
>
> Impressive findings.
>
> Now that you found the dynamic-wind’s out-guard does not even run:

It does not run on C-c, but it does run in other cases, typically if you
just press Enter after reading the message that says “command failed,
press Enter”.

I don’t see how to address the C-c issue so we’ll have to live with it.

Longer-term we may have to find a different strategy than the
‘call-with-mnt-container’ trick, but that’s difficult.

Toggle quote (4 lines)
> After seeing that you opened a bug#60116 on setsid(), I tested removing
> the setsid call and it had no effect, but if the dynamic-wind’s
> out-guard does not even run, that is to be expected.

Right; #60116 is related, and it’s not great but it’s not critical.

Ludo’.
L
L
Ludovic Courtès wrote on 17 Dec 2022 17:15
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87v8magg6e.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (6 lines)
> I did reproduce the issue in a VM by running “ifconfig ens3 down” in a
> tty, or by killing the ‘guix substitute’ process, to cause failure of
> ‘guix system init’. In that case the database is indeed restored, but I
> occasionally get errors like “/gnu/store/….drv: No such file or
> directory”.

The error message that’s haunting us:

opening file `/gnu/store/….drv': No such file or directory

comes from guix-daemon. It happens while the client is doing an
‘add-text-to-store’ RPC to add that .drv to the store.
‘LocalStore::addTextToStore’ supposedly creates the .drv file in
/gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
-> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.

It would suggest that the database is consistent, but that somehow
writes don’t go through the overlay FS.

More investigation is needed, but we may have to live with this bug in
1.4.0.

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 17 Dec 2022 20:27
(name . Ludovic Courtès)(address . ludo@gnu.org)
87zgblhlu8.fsf@pelzflorian.de
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (13 lines)
> The error message that’s haunting us:
>
> opening file `/gnu/store/….drv': No such file or directory
>
> comes from guix-daemon. It happens while the client is doing an
> ‘add-text-to-store’ RPC to add that .drv to the store.
> ‘LocalStore::addTextToStore’ supposedly creates the .drv file in
> /gnu/store and then reads it back (‘registerValidPath’ -> ‘addValidPath’
> -> ‘readDerivation’ -> ‘readFile’): this is where it gets ENOENT.
>
> It would suggest that the database is consistent, but that somehow
> writes don’t go through the overlay FS.

Most interesting.

I saw a comment
Toggle quote (7 lines)
> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
> {
> /* SQLite will fsync by default, but the new valid paths may not be fsync-ed.
> * So some may want to fsync them before registering the validity, at the
> * expense of some speed of the path registering operation. */
> if (settings.syncBeforeRegistering) sync();

In vain, I therefore tried
Toggle diff (13 lines)
diff --git a/nix/libstore/globals.cc b/nix/libstore/globals.cc
index d4f9a46a74..5f8a3a3031 100644
--- a/nix/libstore/globals.cc
+++ b/nix/libstore/globals.cc
@@ -40,7 +40,7 @@ Settings::Settings()
reservedSize = 8 * 1024 * 1024;
fsyncMetadata = true;
useSQLiteWAL = true;
- syncBeforeRegistering = false;
+ syncBeforeRegistering = true;
useSubstitutes = true;
useChroot = false;
impersonateLinux26 = false;
But it changes nothing.

Regards,
Florian
P
P
pelzflorian (Florian Pelz) wrote on 17 Dec 2022 20:36
(name . Ludovic Courtès)(address . ludo@gnu.org)
87v8m9hlg4.fsf@pelzflorian.de
Ahoi. :)

Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (5 lines)
>> Now that you found the dynamic-wind’s out-guard does not even run:
> It does not run on C-c, but it does run in other cases, typically if you
> just press Enter after reading the message that says “command failed,
> press Enter”.

Ahh. Then would it be good if you at least pushed the partial fix about
replacing 'restart' with 'stop-service'? I’m unsure now if it has an
effect on the likelihood that a second resume works again. But maybe it
does. And is closer to correct.


Toggle quote (2 lines)
> I don’t see how to address the C-c issue so we’ll have to live with it.

Yes. Thank you for all investigations!

Regards,
Florian
L
L
Ludovic Courtès wrote on 17 Dec 2022 22:30
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87len5hg5h.fsf@gnu.org
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> skribis:

Toggle quote (10 lines)
> I saw a comment
>> void LocalStore::registerValidPaths(const ValidPathInfos & infos)
>> {
>> /* SQLite will fsync by default, but the new valid paths may not be fsync-ed.
>> * So some may want to fsync them before registering the validity, at the
>> * expense of some speed of the path registering operation. */
>> if (settings.syncBeforeRegistering) sync();
>
> In vain, I therefore tried

Yeah, I don’t think this has much to do with syncing data on disk. It’s
an inconsistency between the store database and the actual store.

Ludo’.
L
L
Ludovic Courtès wrote on 18 Dec 2022 01:23
(name . pelzflorian (Florian Pelz))(address . pelzflorian@pelzflorian.de)
87cz8hh85l.fsf@gnu.org
After spending a few more hours on this, I got convinced that upon
restarting guix-daemon, even though we had restored
/var/guix/db/db.sqlite, the presence of stale db.sqlite-{wal,shm} files
could lead sqlite to do as if transactions in the WAL file had been
committed.

Commit 495c50008be91429ebea3805e161a1e385a2a572 deletes these two
files, and it appears to solve the problem for me.

I also pushed the patch previously shared in this thread, to make sure
db.sqlite is only copied when guix-daemon is stopped.

So we have this:

495c50008b installer: final: Delete SQLite WAL and shm files upon completion.
9b6703eabe installer: final: Stop guix-daemon before accessing store database.

I’ll go ahead and prepare for the release as planned, to be published on Monday.

Ludo’.
P
P
pelzflorian (Florian Pelz) wrote on 18 Dec 2022 17:41
(name . Ludovic Courtès)(address . ludo@gnu.org)
874jtsvf42.fsf@pelzflorian.de
"pelzflorian (Florian Pelz)" <pelzflorian@pelzflorian.de> writes:
Toggle quote (3 lines)
> * If I resume a crashed installer, I need to resume twice because the
> first resume always fails immediately.

Hooray, you fixed it. Ludo, your debugging speed is miraculous. I did
not know SQLite uses multiple files per database.


Toggle quote (3 lines)
> * With bad luck, it permanently fails, even a second, third, fourth,
> fifth time fail.

It can still permanently fail to resume, e.g. sometimes when doing
Ctrl-c during download of a substitue, it will continue to say nss-certs
is an unknown package, but that may be too rare to happen by chance and
is not what this bug was about.

Closing!

Regards,
Florian
Closed
?