gforge.inria.fr to be taken off-line in Dec. 2020

OpenSubmitted by Ludovic Courtès.
Details
11 participants
  • Andreas Enge
  • Dr. Arne Babenhauserheide
  • Bengt Richter
  • Ludovic Courtès
  • Ludovic Courtès
  • Christopher Baines
  • Maxim Cournoyer
  • raingloom
  • Ricardo Wurmus
  • Timothy Sample
  • zimoun
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 2 Jul 2020 09:29
(address . bug-guix@gnu.org)
87mu4iv0gc.fsf@inria.fr
Hello!

The hosting site gforge.inria.fr will be taken off-line in December
2020. This GForge instance hosts source code as tarballs, Subversion
repos, and Git repos. Users have been invited to migrate to
gitlab.inria.fr, which is Git only. It seems that Software Heritage
hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the
situation in this issue.

The following packages have their source on gforge.inria.fr:

Toggle snippet (14 lines)
scheme@(guile-user)> ,pp packages-on-gforge
$7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
#<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
#<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
#<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
#<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
#<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
#<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
#<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
#<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)

‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s
also mirrored at gcc.gnu.org apparently.

Of these, the following are available on Software Heritage:

Toggle snippet (11 lines)
scheme@(guile-user)> ,pp archived-source
$8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
#<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
#<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
#<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
#<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
#<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320>
#<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>)

So we’ll be missing these:

Toggle snippet (8 lines)
scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
$11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
#<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
#<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
#<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
#<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)

Attached the code I used for this.

Thanks,
Ludo’.
(use-modules (guix) (gnu) (guix svn-download) (guix git-download) (guix swh) (ice-9 match) (srfi srfi-1) (srfi srfi-26)) (define (gforge? package) (define (gforge-string? str) (string-contains str "gforge.inria.fr")) (match (package-source package) ((? origin? o) (match (origin-uri o) ((? string? url) (gforge-string? url)) (((? string? urls) ...) (any gforge-string? urls)) ;or 'find' ((? git-reference? ref) (gforge-string? (git-reference-url ref))) ((? svn-reference? ref) (gforge-string? (svn-reference-url ref))) (_ #f))) (_ #f))) (define packages-on-gforge (fold-packages (lambda (package result) (if (gforge? package) (cons package result) result)) '())) (define archived-source (filter (lambda (package) (let* ((origin (package-source package)) (hash (origin-hash origin))) (lookup-content (content-hash-value hash) (symbol->string (content-hash-algorithm hash))))) packages-on-gforge))
Z
Z
zimoun wrote on 2 Jul 2020 10:50
(name . Maurice Brémond)(address . Maurice.Bremond@inria.fr)
86h7uq8fmk.fsf@gmail.com
Hi Ludo,

On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

Toggle quote (7 lines)
> The hosting site gforge.inria.fr will be taken off-line in December
> 2020. This GForge instance hosts source code as tarballs, Subversion
> repos, and Git repos. Users have been invited to migrate to
> gitlab.inria.fr, which is Git only. It seems that Software Heritage
> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the
> situation in this issue.

[...]

Toggle quote (9 lines)
> --8<---------------cut here---------------start------------->8---
> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
> --8<---------------cut here---------------end--------------->8---

All the 5 are 'url-fetch' so we can expect that sources.json will be up
before the shutdown on December. :-)

Then, all the 14 packages we have from gforge.inria.fr will be
git-fetch, right? So should we contact upstream to inform us when they
switch? Then we can adapt the origin.

Toggle quote (5 lines)
> (use-modules (guix) (gnu)
> (guix svn-download)
> (guix git-download)
> (guix swh)

It does not work properly if I do not replace by

((guix swh) #:hide (origin?))

Well, I have no investigate further.

Toggle quote (4 lines)
> (ice-9 match)
> (srfi srfi-1)
> (srfi srfi-26))

[...]

Toggle quote (9 lines)
> (define archived-source
> (filter (lambda (package)
> (let* ((origin (package-source package))
> (hash (origin-hash origin)))
> (lookup-content (content-hash-value hash)
> (symbol->string
> (content-hash-algorithm hash)))))
> packages-on-gforge))

I am a bit lost about the other discussion on falling back for tarball.
But that's another story. :-)


Cheers,
simon
L
L
Ludovic Courtès wrote on 2 Jul 2020 12:03
(name . zimoun)(address . zimon.toutoune@gmail.com)
87d05etero.fsf@gnu.org
zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (23 lines)
> On Thu, 02 Jul 2020 at 09:29, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
>
>> The hosting site gforge.inria.fr will be taken off-line in December
>> 2020. This GForge instance hosts source code as tarballs, Subversion
>> repos, and Git repos. Users have been invited to migrate to
>> gitlab.inria.fr, which is Git only. It seems that Software Heritage
>> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the
>> situation in this issue.
>
> [...]
>
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> ,pp (lset-difference eq? $7 $8)
>> $11 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
>> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
>> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
>> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
>> --8<---------------cut here---------------end--------------->8---
>
> All the 5 are 'url-fetch' so we can expect that sources.json will be up
> before the shutdown on December. :-)

Unfortunately, it won’t help for tarballs:


There’s this other discussion you mentioned, which I hope will have a
positive outcome:


Toggle quote (9 lines)
>> (use-modules (guix) (gnu)
>> (guix svn-download)
>> (guix git-download)
>> (guix swh)
>
> It does not work properly if I do not replace by
>
> ((guix swh) #:hide (origin?))

Oh right, I had overlooked this as I played at the REPL.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 10 Jul 2020 00:30
control message for bug #42162
(address . control@debbugs.gnu.org)
877dvc9v9o.fsf@gnu.org
severity 42162 important
quit
L
L
Ludovic Courtès wrote on 11 Jul 2020 17:50
Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87r1tit5j6.fsf_-_@gnu.org
Hi,

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (5 lines)
> There’s this other discussion you mentioned, which I hope will have a
> positive outcome:
>
> https://forge.softwareheritage.org/T2430

This discussion as well as discussions on #swh-devel have made it clear
that SWH will not archive raw tarballs, at least not in the foreseeable
future. Instead, it will keep archiving the contents of tarballs, as it
has always done—that’s already a huge service.

Not storing raw tarballs makes sense from an engineering perspective,
but it does mean that we cannot rely on SWH as a content-addressed
mirror for tarballs. (In fact, some raw tarballs are available on SWH,
but that’s mostly “by chance”, for instance because they appear as-is in
a Git repo that was ingested.) In fact this is one of the challenges
mentioned in

So we need a solution for now (and quite urgently), and a solution for
the future.

For the now, since 70% of our packages use ‘url-fetch’, we need to be
able to fetch or to reconstruct tarballs. There’s no way around it.

In the short term, we should arrange so that the build farm keeps GC
roots on source tarballs for an indefinite amount of time. Cuirass
jobset? Mcron job to preserve GC roots? Ideas?

For the future, we could store nar hashes of unpacked tarballs instead
of hashes over tarballs. But that raises two questions:

• If we no longer deal with tarballs but upstreams keep signing
tarballs (not raw directory hashes), how can we authenticate our
code after the fact?

• SWH internally store Git-tree hashes, not nar hashes, so we still
wouldn’t be able to fetch our unpacked trees from SWH.

(Both issues were previously discussed at

So for the medium term, and perhaps for the future, a possible option
would be to preserve tarball metadata so we can reconstruct them:

tarball = metadata + tree

After all, tarballs are byproducts and should be no exception: we should
build them from source. :-)

pristine-tar, which does almost that, but not quite: it stores a binary
delta between a tarball and a tree:


I think we should have something more transparent than a binary delta.

The code below can “disassemble” and “assemble” a tar. When it
disassembles it, it generates metadata like this:

Toggle snippet (32 lines)
(tar-source
(version 0)
(headers
(("guile-3.0.4/"
(mode 493)
(size 0)
(mtime 1593007723)
(chksum 3979)
(typeflag #\5))
("guile-3.0.4/m4/"
(mode 493)
(size 0)
(mtime 1593007720)
(chksum 4184)
(typeflag #\5))
("guile-3.0.4/m4/pipe2.m4"
(mode 420)
(size 531)
(mtime 1536050419)
(chksum 4812)
(hash (sha256
"arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
("guile-3.0.4/m4/time_h.m4"
(mode 420)
(size 5471)
(mtime 1536050419)
(chksum 4974)
(hash (sha256
"z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
[…]

The ’assemble-archive’ procedure consumes that, looks up file contents
by hash on SWH, and reconstructs the original tarball…

… at least in theory, because in practice we hit the SWH rate limit
after looking up a few files:


So it’s a bit ridiculous, but we may have to store a SWH “dir”
identifier for the whole extracted tree—a Git-tree hash—since that would
allow us to retrieve the whole thing in a single HTTP request.

Besides, we’ll also have to handle compression: storing gzip/xz headers
and compression levels.


How would we put that in practice? Good question. :-)

I think we’d have to maintain a database that maps tarball hashes to
metadata (!). A simple version of it could be a Git repo where, say,
‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
contain the metadata above. The nice thing is that the Git repo itself
could be archived by SWH. :-)

Thus, if a tarball vanishes, we’d look it up in the database and
reconstruct it from its metadata plus content store in SWH.

Thoughts?

Anyhow, we should team up with fellow NixOS and SWH hackers to address
this, and with developers of other distros as well—this problem is not
just that of the functional deployment geeks, is it?

Ludo’.
;;; GNU Guix --- Functional package management for GNU ;;; Copyright © 2020 Ludovic Courtès <ludo@gnu.org> ;;; ;;; This file is part of GNU Guix. ;;; ;;; GNU Guix is free software; you can redistribute it and/or modify it ;;; under the terms of the GNU General Public License as published by ;;; the Free Software Foundation; either version 3 of the License, or (at ;;; your option) any later version. ;;; ;;; GNU Guix is distributed in the hope that it will be useful, but ;;; WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>. (define-module (tar) #:use-module (ice-9 match) #:use-module (ice-9 binary-ports) #:use-module (rnrs bytevectors) #:use-module (srfi srfi-1) #:use-module (srfi srfi-9) #:use-module (srfi srfi-26) #:use-module (gcrypt hash) #:use-module (guix base16) #:use-module (guix base32) #:use-module ((ice-9 rdelim) #:select ((read-string . get-string-all))) #:use-module (web client) #:use-module (web response) #:export (disassemble-archive assemble-archive)) ;;; ;;; Tar. ;;; (define %TMAGIC "ustar\0") (define %TVERSION "00") (define-syntax-rule (define-field-type type type-size read-proc write-proc) "Define TYPE as a ustar header field type of TYPE-SIZE bytes. READ-PROC is the procedure to obtain the value of an object of this type froma bytevector, and WRITE-PROC writes it to a bytevector." (define-syntax type (syntax-rules (read write size) ((_ size) type-size) ((_ read) read-proc) ((_ write) write-proc)))) (define (sub-bytevector bv offset size) (let ((sub (make-bytevector size))) (bytevector-copy! bv offset sub 0 size) sub)) (define (read-integer bv offset len) (string->number (read-string bv offset len) 8)) (define read-integer12 (cut read-integer <> <> 12)) (define read-integer8 (cut read-integer <> <> 8)) (define (read-string bv offset max-len) (define len (let loop ((len 0)) (cond ((= len max-len) len) ((zero? (bytevector-u8-ref bv (+ offset len))) len) (else (loop (+ 1 len)))))) (utf8->string (sub-bytevector bv offset len))) (define read-string155 (cut read-string <> <> 155)) (define read-string100 (cut read-string <> <> 100)) (define read-string32 (cut read-string <> <> 32)) (define read-string6 (cut read-string <> <> 6)) (define read-string2 (cut read-string <> <> 2)) (define (read-character bv offset) (integer->char (bytevector-u8-ref bv offset))) (define (read-padding12 bv offset) (bytevector-uint-ref bv offset (endianness big) 12)) (define (write-integer! bv offset value len) (let ((str (string-pad (number->string value 8) (- len 1) #\0))) (write-string! bv offset str len))) (define write-integer12! (cut write-integer! <> <> <> 12)) (define write-integer8! (cut write-integer! <> <> <> 8)) (define (write-string! bv offset str len) (let* ((str (string-pad-right str len #\nul)) (buf (string->utf8 str))) (bytevector-copy! buf 0 bv offset (bytevector-length buf)))) (define write-string155! (cut write-string! <> <> <> 155)) (define write-string100! (cut write-string! <> <> <> 100)) (define write-string32! (cut write-string! <> <> <> 32)) (define write-string6! (cut write-string! <> <> <> 6)) (define write-string2! (cut write-string! <> <> <> 2)) (define (write-character! bv offset value) (bytevector-u8-set! bv offset (char->integer value))) (define (write-padding12! bv offset value) (bytevector-uint-set! bv offset value (endianness big) 12)) (define-field-type integer12 12 read-integer12 write-integer12!) (define-field-type integer8 8 read-integer8 write-integer8!) (define-field-type character 1 read-character write-character!) (define-field-type string155 155 read-string155 write-string155!) (define-field-type string100 100 read-string100 write-string100!) (define-field-type string32 32 read-string32 write-string32!) (define-field-type string6 6 read-string6 write-string6!) (define-field-type string2 2 read-string2 write-string2!) (define-field-type padding12 12 read-padding12 write-padding12!) (define-syntax define-pack (syntax-rules () ((_ type ctor pred write-header read-header (field-names field-types field-getters) ...) (begin (define-record-type type (ctor field-names ...) pred (field-names field-getters) ...) (define (read-header port) "Return the ustar header read from PORT." (set-port-encoding! port "ISO-8859-1") (let ((bv (get-bytevector-n port (+ (field-types size) ...)))) (letrec-syntax ((build (syntax-rules () ((_ bv () offset (fields (... ...))) (ctor fields (... ...))) ((_ bv (type0 types (... ...)) offset (fields (... ...))) (build bv (types (... ...)) (+ offset (type0 size)) (fields (... ...) ((type0 read) bv offset))))))) (build bv (field-types ...) 0 ())))) (define (write-header header port) "Serialize HEADER, a <ustar-header> record, to PORT." (let* ((len (+ (field-types size) ...)) (bv (make-bytevector len))) (match header (($ type field-names ...) (letrec-syntax ((write! (syntax-rules () ((_ () offset) #t) ((_ ((type value) rest (... ...)) offset) (begin ((type write) bv offset value) (write! (rest (... ...)) (+ offset (type size)))))))) (write! ((field-types field-names) ...) 0) (put-bytevector port bv)))))))))) ;; The ustar header. See <tar.h>. (define-pack <ustar-header> %make-ustar-header ustar-header? write-ustar-header read-ustar-header (name string100 ustar-header-name) ;NUL-terminated if NUL fits (mode integer8 ustar-header-mode) (uid integer8 ustar-header-uid) (gid integer8 ustar-header-gid) (size integer12 ustar-header-size) (mtime integer12 ustar-header-mtime) (chksum integer8 ustar-header-checksum) (typeflag character ustar-header-type-flag) (linkname string100 ustar-header-link-name) (magic string6 ustar-header-magic) ;must be TMAGIC (version string2 ustar-header-version) ;must be TVERSION (uname string32 ustar-header-uname) ;NUL-terminated (gname string32 ustar-header-gname) ;NUL-terminated (devmajor integer8 ustar-header-device-major) (devminor integer8 ustar-header-device-minor) (prefix string155 ustar-header-prefix) ;NUL-terminated if NUL fits (padding padding12 ustar-header-padding)) (define* (make-ustar-header name #:key (mode 0) (uid 0) (gid 0) (size 0) (mtime 0) (checksum 0) (type-flag 0) (link-name "") (magic %TMAGIC) (version %TVERSION) (uname "") (gname "") (device-major 0) (device-minor 0) (prefix "") (padding 0)) (%make-ustar-header name mode uid gid size mtime checksum type-flag link-name magic version uname gname device-major device-minor prefix padding)) (define %zero-header ;; The all-zeros header, which marks the end of stream. (read-ustar-header (open-bytevector-input-port (make-bytevector 512 0)))) (define (consumer port) "Return a procedure that consumes or skips the given number of bytes from PORT." (if (false-if-exception (seek port 0 SEEK_CUR)) (lambda (len) (seek port len SEEK_CUR)) (lambda (len) (define bv (make-bytevector 8192)) (let loop ((len len)) (define block (min len (bytevector-length bv))) (unless (or (zero? block) (eof-object? (get-bytevector-n! port bv 0 block))) (loop (- len block))))))) (define (fold-archive proc seed port) "Read ustar headers from PORT; for each header, call PROC." (define skip (consumer port)) (let loop ((result seed)) (define header (read-ustar-header port)) (if (equal? header %zero-header) result (let* ((result (proc header port result)) (size (ustar-header-size header)) (remainder (modulo size 512))) ;; It's up to PROC to consume the SIZE bytes of data corresponding ;; to HEADER. Here we consume padding. (unless (zero? remainder) (skip (- 512 remainder))) (loop result))))) ;;; ;;; Disassembling/assembling an archive. ;;; (define (dump in out size) "Copy SIZE bytes from IN to OUT." (define buf-size 65536) (define buf (make-bytevector buf-size)) (let loop ((left size)) (if (<= left 0) 0 (let ((read (get-bytevector-n! in buf 0 (min left buf-size)))) (if (eof-object? read) left (begin (put-bytevector out buf 0 read) (loop (- left read)))))))) (define* (disassemble-archive port #:optional (algorithm (hash-algorithm sha256))) "Read tar archive from PORT and return an sexp representing its metadata, including individual file hashes with ALGORITHM." (define headers+hashes (fold-archive (lambda (header port result) (if (zero? (ustar-header-size header)) (alist-cons header #f result) (let () (define-values (hash-port get-hash) (open-hash-port algorithm)) (dump port hash-port (ustar-header-size header)) (close-port hash-port) (alist-cons header (get-hash) result)))) '() port)) (define header+hash->sexp (match-lambda ((header . hash) (letrec-syntax ((serialize (syntax-rules () ((_) '()) ((_ (tag get default) rest ...) (let ((value (get header))) (append (if (equal? default value) '() `((tag ,value))) (serialize rest ...)))) ((_ (tag get) rest ...) (append `((tag ,(get header))) (serialize rest ...)))))) `(,(ustar-header-name header) ,@(serialize (mode ustar-header-mode) (uid ustar-header-uid 0) (gid ustar-header-gid 0) (size ustar-header-size) (mtime ustar-header-mtime) (chksum ustar-header-checksum) (typeflag ustar-header-type-flag #\nul) (linkname ustar-header-link-name "") (magic ustar-header-magic "") (version ustar-header-version "") (uname ustar-header-uname "") (gname ustar-header-gname "") (devmajor ustar-header-device-major 0) (devminor ustar-header-device-minor 0) (prefix ustar-header-prefix "") (padding ustar-header-padding 0) (hash (lambda (_) (and hash `(,(hash-algorithm-name algorithm) ,(bytevector->base32-string hash)))) #f))))))) `(tar-source (version 0) (headers ,(map header+hash->sexp (reverse headers+hashes))))) (define (fetch-from-swh algorithm hash) (define url (string-append "https://archive.softwareheritage.org/api/1/content/" (symbol->string algorithm) ":" (bytevector->base16-string hash) "/raw/")) (define-values (response port) (http-get url #:streaming? #t #:verify-certificate? #f)) (if (= 200 (response-code response)) port (throw 'swh-fetch-error url (get-string-all port)))) (define* (assemble-archive source port #:optional (fetch-data fetch-from-swh)) "Assemble archive from SOURCE, an sexp as returned by 'disassemble-archive'." (define sexp->header (match-lambda ((name . properties) (let ((ref (lambda (field) (and=> (assq-ref properties field) car)))) (make-ustar-header name #:mode (ref 'mode) #:uid (or (ref 'uid) 0) #:gid (or (ref 'gid) 0) #:size (ref 'size) #:mtime (ref 'mtime) #:checksum (ref 'chksum) #:type-flag (or (ref 'typeflag) #\nul) #:link-name (or (ref 'linkname) "") #:magic (or (ref 'magic) "") #:version (or (ref 'version) "") #:uname (or (ref 'uname) "") #:gname (or (ref 'gname) "") #:device-major (or (ref 'devmajor) 0) #:device-minor (or (ref 'devminor) 0) #:prefix (or (ref 'prefix) "") #:padding (or (ref 'padding) 0)))))) (define sexp->data (match-lambda ((name . properties) (match (assq-ref properties 'hash) (((algorithm (= base32-string->bytevector hash)) _ ...) (fetch-data algorithm hash)) (#f (open-input-string "")))))) (match source (('tar-source ('version 0) ('headers headers) _ ...) (for-each (lambda (sexp) (let ((header (sexp->header sexp)) (data (sexp->data sexp))) (write-ustar-header header port) (dump-port data port) (close-port data))) headers))))
C
C
Christopher Baines wrote on 13 Jul 2020 21:20
(name . Ludovic Courtès)(address . ludo@gnu.org)
87a703jk78.fsf@cbaines.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (32 lines)
> Hi,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> There’s this other discussion you mentioned, which I hope will have a
>> positive outcome:
>>
>> https://forge.softwareheritage.org/T2430
>
> This discussion as well as discussions on #swh-devel have made it clear
> that SWH will not archive raw tarballs, at least not in the foreseeable
> future. Instead, it will keep archiving the contents of tarballs, as it
> has always done—that’s already a huge service.
>
> Not storing raw tarballs makes sense from an engineering perspective,
> but it does mean that we cannot rely on SWH as a content-addressed
> mirror for tarballs. (In fact, some raw tarballs are available on SWH,
> but that’s mostly “by chance”, for instance because they appear as-is in
> a Git repo that was ingested.) In fact this is one of the challenges
> mentioned in
> <https://guix.gnu.org/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/>.
>
> So we need a solution for now (and quite urgently), and a solution for
> the future.
>
> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs. There’s no way around it.
>
> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time. Cuirass
> jobset? Mcron job to preserve GC roots? Ideas?

Going forward, being methodical as a project about storing the tarballs
and source material for the packages is probalby the way to ensure it's
available for the future. I'm not sure the data storage cost is
significant, the cost of doing this is probably in working out what to
store, doing so in a redundant manor, and making the data available.

The Guix Data Service knows about fixed output derivations, so it might
be possible to backfill such a store by just attempting to build those
derivations. It might also be possible to use the Guix Data Service to
work out what's available, and what tarballs are missing.

Chris
-----BEGIN PGP SIGNATURE-----

iQKTBAEBCgB9FiEEPonu50WOcg2XVOCyXiijOwuE9XcFAl8Ms/tfFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcACgkQXiijOwuE
9XfAtw/6AtEyqRcimef5NTFchcAigC6fT6DJLcGnyJNUXlfZn6nHU9ao/ev33D5d
MFfKl1YljKf+fA848fZSIe0eBERbkZ+D1oed6SD6Xx8fG9ekCSgGtbmysNEcDDKK
qO5kg/QUbKYODpRW8iZIDMPUQZ0yNQu9KQdvVKIhHIZJnSGNt2XVjRdoCkW+H19m
QVPVdgqZIarkZctOzPegA8FFEi8O/GO7gK4gbizewecgsl1qL0yWBDyUJ9tsWeAH
+EsVykk91y9tHDPfQYfKqik7A0WrK75oeNOqs5QtEqRPjcMzwsDkIO13e5Y3Z5Yl
M7zTs7R/OLSyiSlT5z/1S5RrbMyMMryt0S4uvqjZfFDtgaOHxhVhBg/1kya/H5v1
cB3jq8WpvL6sDYFbSqI9vWPJnQDq5EpIvI16Ri0ygnMAffiz6hhtdn/pCGV7GG5U
7H6ED7gz5FB8YovGED1C9l8dh7h3Hi+1P+JL3KheJyF5bU829wqL9r2l5sOprad0
PEsq52RCwPBuNu8agTbobICimqFnp3B5wySDNEvkXZ4FFlMR6ZdW0BjBnLF0ZRU4
v8FCf+w81lAIksF9UWusZTzb++aMPXsdlHfelyWtOUi5mc1GMNRfCIW/VLIYyZIP
aqVPHoFkTWb6q6XK5tjC302Di/BD9qDEr5g9qFU16Yeq7ywcAjs=
=hgnz
-----END PGP SIGNATURE-----

Z
Z
zimoun wrote on 15 Jul 2020 18:55
Re: Recovering source tarballs
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ2iesMKLLD29qrOJzNBjV=haoPFtpRg9T=0aTA8ZxLQMA@mail.gmail.com
Hi Ludo,

Well, you enlarge the discussion to more than the issue of the 5
url-fetch packages on gforge.inria.fr :-)


First of all, you wrote [1] ``Migration away from tarballs is already
happening as more and more software is distributed straight from
content-addressed VCS repositories, though progress has been relatively
slow since we first discussed it in 2016.'' but on the other hand Guix
uses more than often [2] "url-fetch" even if "git-fetch" is available
upstream. Other said, I am not convinced the migration is really
happening...

The issue would be mitigated if Guix transitions from "url-fetch" to
"git-fetch" when possible.



Second, trying to do some stats about the SWH coverage, I note that
non-neglectible "url-fetch" are reachable by "lookup-content". The
coverage is not straightforward because of the 120 request per hour rate
limit or unexpected server error. Another story.

Well, I would like having numbers because I do not know what is
concretely the issue: how many "url-fetch" packages are reachable? And
if they are unreachable, is it because they are not in yet? or is it
because Guix does not have enough info to lookup them?


On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (3 lines)
> For the now, since 70% of our packages use ‘url-fetch’, we need to be
> able to fetch or to reconstruct tarballs. There’s no way around it.

Yes, but for example all the packages in gnu/packages/bioconductor.scm
could be "git-fetch". Today the source is over url-fetch but it could
git@git.bioconductor.org:packages/flowCore.

Another example is the packages in gnu/packages/emacs-xyz.scm and the
ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
example using

So I would be more reserved about the "no way around it". :-) I mean
the 70% could be a bit mitigated.


Toggle quote (4 lines)
> In the short term, we should arrange so that the build farm keeps GC
> roots on source tarballs for an indefinite amount of time. Cuirass
> jobset? Mcron job to preserve GC roots? Ideas?

Yes, preserving source tarballs for an indefinite amount of time will
help. At least all the packages where "lookup-content" returns #f,
which means they are not in SWH or they are unreachable -- both is
equivalent from Guix side.

What about in addition push to IPFS? Feasible? Lookup issue?

Toggle quote (7 lines)
> For the future, we could store nar hashes of unpacked tarballs instead
> of hashes over tarballs. But that raises two questions:
>
> • If we no longer deal with tarballs but upstreams keep signing
> tarballs (not raw directory hashes), how can we authenticate our
> code after the fact?

Does Guix automatically authenticate code using signed tarballs?


Toggle quote (11 lines)
> • SWH internally store Git-tree hashes, not nar hashes, so we still
> wouldn’t be able to fetch our unpacked trees from SWH.
>
> (Both issues were previously discussed at
> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
> tarball = metadata + tree

There is different issues at different levels:

1. how to lookup? what information do we need to keep/store to be able
to query SWH?
2. how to check the integrity? what information do we need to
keep/store to be able to verify that SWH returns what Guix expects?
3. how to authenticate? where the tarball metadata has to be stored if
SWH removes it?

Basically, the git-fetch source stores 3 identifiers:

- upstream url
- commit / tag
- integrity (sha256)

Fetching from SWH requires the commit only (lookup-revision) or the
tag+url (lookup-origin-revision) then from the returned revision, the
integrity of the downloaded data is checked using the sha256, right?

Therefore, one way to fix lookup of the url-fetch source is to add an
extra field mimicking the commit role.

The easiest is to store a SWHID or an identifier allowing to deduce the
SWHID.

I have not checked the code, but something like this:


and at package time, this identifier is added, similarly to integrity.

Aside, does Guix use the authentication metadata that tarballs provide?


( BTW, I failed [3,4] to package swh.model so if someone wants to give a
try.


Toggle quote (3 lines)
> After all, tarballs are byproducts and should be no exception: we should
> build them from source. :-)

[...]

Toggle quote (3 lines)
> The code below can “disassemble” and “assemble” a tar. When it
> disassembles it, it generates metadata like this:

[...]

Toggle quote (3 lines)
> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…

Where do you plan to store the "disassembled" metadata?
And where do you plan to "assemble-archive"?

I mean,

What is pushed to SWH? And how?
What is fetched from SWH? And how?

(Well, answer below. :-))

Toggle quote (3 lines)
> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:

Yes, it is 120 request per hour and 10 save per hour. Well, I do not
think they will increase much these numbers in general. However,
they seem open for specific machines. So, I do not want to speak for
them, but we could ask an higher rate limit for ci.guix.gnu.org for
example. Then we need to distinguish between source substitutes and
binary substitutes. And basically, when an user runs "guix build foo",
if the source is not available upstream nor already on ci.guix.gnu.org,
then ci.guix.gnu.org fetch the missing sources from SWH and delivers it
to the user.


Toggle quote (6 lines)
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.

Well, the limited resources of SWH is an issue but SWH is not a mirror
but an archive. :-)

And as I wrote above, we could ask to SWH to increase the rate limit for
specific machine such as ci.guix.gnu.org


Toggle quote (6 lines)
> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!). A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above. The nice thing is that the Git repo itself
> could be archived by SWH. :-)

How this database that maps tarball hashes to metadata should be
maintained? Git push hook? Cron task?

What about foreign channels? Should they maintain their own map?

To summary, it would work like this, right?

at package time:
- store an integrity identiter (today sha256-nix-base32)
- disassemble the tarball
- commit to another repo the metadata using the path (address)
sha256/base32/<identitier>
- push to packages-repo *and* metadata-database-repo

at future time: (upstream has disappeared, say!)
- use the integrity identifier to query the database repo
- lookup the SWHID from the database repo
- fetch the data from SWH
- or lookup the IPFS identifier from the database repo and fetch the
data from IPFS, for another example
- re-assemble the tarball using the metadata from the database repo
- check integrity, authentication, etc.

Well, right it is better than only adding an identifier for looking up
as I described above; because it is more general and flexible than only
SWH as fall-back.

The format of metadata (disassemble) that you propose is schemish
(obviously! :-)) but we could propose something more JSON-like.


All the best,
simon
L
L
Ludovic Courtès wrote on 20 Jul 2020 10:39
(name . zimoun)(address . zimon.toutoune@gmail.com)
87365mzil1.fsf@gnu.org
Hi!

There are many many comments in your message, so I took the liberty to
reply only to the essence of it. :-)

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (18 lines)
> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs. There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch". Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-) I mean
> the 70% could be a bit mitigated.

The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.

However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.

Toggle quote (11 lines)
>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time. Cuirass
>> jobset? Mcron job to preserve GC roots? Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help. At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS? Feasible? Lookup issue?

Lookup issue. :-) The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root. So it would seem we can’t
use IPFS as-is for tarballs.

Toggle quote (9 lines)
>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs. But that raises two questions:
>>
>> • If we no longer deal with tarballs but upstreams keep signing
>> tarballs (not raw directory hashes), how can we authenticate our
>> code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?

Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).

Toggle quote (30 lines)
>> • SWH internally store Git-tree hashes, not nar hashes, so we still
>> wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>> tarball = metadata + tree
>
> There is different issues at different levels:
>
> 1. how to lookup? what information do we need to keep/store to be able
> to query SWH?
> 2. how to check the integrity? what information do we need to
> keep/store to be able to verify that SWH returns what Guix expects?
> 3. how to authenticate? where the tarball metadata has to be stored if
> SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
> - upstream url
> - commit / tag
> - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?

Yes.

Toggle quote (3 lines)
> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.

But today, we store tarball hashes, not directory hashes.

Toggle quote (10 lines)
> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
> https://pypi.org/project/swh.model/
> https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.

I’m skeptical about adding a field that is practically never used.

[...]

Toggle quote (11 lines)
>> The code below can “disassemble” and “assemble” a tar. When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?

We’d have a repo/database containing metadata indexed by tarball sha256.

Toggle quote (3 lines)
> How this database that maps tarball hashes to metadata should be
> maintained? Git push hook? Cron task?

Yes, something like that. :-)

Toggle quote (2 lines)
> What about foreign channels? Should they maintain their own map?

Yes, presumably.

Toggle quote (18 lines)
> To summary, it would work like this, right?
>
> at package time:
> - store an integrity identiter (today sha256-nix-base32)
> - disassemble the tarball
> - commit to another repo the metadata using the path (address)
> sha256/base32/<identitier>
> - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
> - use the integrity identifier to query the database repo
> - lookup the SWHID from the database repo
> - fetch the data from SWH
> - or lookup the IPFS identifier from the database repo and fetch the
> data from IPFS, for another example
> - re-assemble the tarball using the metadata from the database repo
> - check integrity, authentication, etc.

That’s the idea.

Toggle quote (3 lines)
> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.

Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).

Thanks,
Ludo’.
Z
Z
zimoun wrote on 20 Jul 2020 17:52
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ0iMNjv93MM1FkEB3_zXA48Rq3rKXhwwug85fNRRc41Mg@mail.gmail.com
Hi,

On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:
Toggle quote (6 lines)
> zimoun <zimon.toutoune@gmail.com> skribis:
> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:

> There are many many comments in your message, so I took the liberty to
> reply only to the essence of it. :-)

Many comments because many open topics. ;-)


Toggle quote (4 lines)
> However, the two examples above are good ideas as to the way forward: we
> could start a url-fetch-to-git-fetch migration in these two cases, and
> perhaps more.

Well, to be honest, I have tried to probe such migration when I opened
this thread:


and I have tried to summarized the pros/cons arguments here:



Toggle quote (7 lines)
> > What about in addition push to IPFS? Feasible? Lookup issue?
>
> Lookup issue. :-) The hash in a CID is not just a raw blob hash.
> Files are typically chunked beforehand, assembled as a Merkle tree, and
> the CID is roughly the hash to the tree root. So it would seem we can’t
> use IPFS as-is for tarballs.

Using the Git-repo map/table, then it becomes an option, right?
Well, SWH would be a backend and IPFS could be another one. Or any
"cloudy" storage system that could appear in the future, right?


Toggle quote (9 lines)
> >> • If we no longer deal with tarballs but upstreams keep signing
> >> tarballs (not raw directory hashes), how can we authenticate our
> >> code after the fact?
> >
> > Does Guix automatically authenticate code using signed tarballs?
>
> Not automatically; packagers are supposed to authenticate code when they
> add a package (‘guix refresh -u’ does that automatically).

So I miss the point of having this authentication information in the
future where upstream has disappeared.
The authentication is done at packaging time. So once it is done,
merged into master and then pushed to SWH, being able to authenticate
again does not really matter.

And if it matters, all should be updated each time vulnerabilities are
discovered and so I am not sure SWH makes sense for this use-case.


Toggle quote (2 lines)
> But today, we store tarball hashes, not directory hashes.

We store what "guix hash" returns. ;-)
So it is easy to migrate from tarball hashes to whatever else. :-)
I mean, it is "(sha256 (base32" and it is easy to have also
"(sha256-tree (base32" or something like that.

In the case where the integrity is also used as lookup key.

Toggle quote (6 lines)
> > The format of metadata (disassemble) that you propose is schemish
> > (obviously! :-)) but we could propose something more JSON-like.
>
> Sure, if that helps get other people on-board, why not (though sexps
> have lived much longer than JSON and XML together :-)).

Lived much longer and still less less less used than JSON or XML alone. ;-)


I have not done yet the clear back-to-envelop computations. Roughly,
there are ~23 commits on average per day updating packages, so say 70%
of them are url-fetch, it is ~16 new tarballs per day, on average.
How the model using a Git-repo will scale? Because, naively the
output of "disassemble-archive" in full text (pretty-print format) for
the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
without considering all the Git internals. Obviously, it depends on
the number of files and I do not know if hello is a representative
example.

And I do not know how Git operates on binary files if the disassembled
tarball is stored as .go file, or any other.


All the best,
simon

ps:
Just if someone wants to check from where I estimate the numbers.

Toggle snippet (17 lines)
for ci in $(git log --after=v1.0.0 --oneline \
| grep "gnu:" | grep -E "(Add|Update)" \
| cut -f1 -d' ')
do
git --no-pager log -1 $ci --format="%cs"
done | uniq -c > /tmp/commits

guix environment --ad-hoc r-minimal \
-- R -e 'summary(read.table("/tmp/commits"))'

gzip -dc < $(guix build -S hello) > /tmp/hello.tar
guix repl -L /tmp/tar/

scheme@(guix-user)> (call-with-input-file "hello.tar"
(lambda (port)
(disassemble-archive port)))
D
D
Dr. Arne Babenhauserheide wrote on 20 Jul 2020 19:05
Re: bug#42162: Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87wo2ynml7.fsf@web.de
zimoun <zimon.toutoune@gmail.com> writes:
Toggle quote (8 lines)
>> > The format of metadata (disassemble) that you propose is schemish
>> > (obviously! :-)) but we could propose something more JSON-like.
>>
>> Sure, if that helps get other people on-board, why not (though sexps
>> have lived much longer than JSON and XML together :-)).
>
> Lived much longer and still less less less used than JSON or XML alone. ;-)

Though this is likely not a function of the format, but of the
popularity of both Javascript and Java.

JSON isn’t a well defined format for arbitrary data (try to store
numbers as keys and reason about what you get as return-values), and
XML is a monster of complexity.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE801qEjXQSQPNItXAE++NRSQDw+sFAl8VzucACgkQE++NRSQD
w+vH5A/+O4YSG9c/P8FD66fdhZ/tOcHBvSxfDu0GDfyB6O9gILuHtDM+OJFlcPvB
Nplo+FV/abU5mw7CeEgbVaK/Nv5MPTEHwbZBTZvlpwPPYtFpyLbxwbTqMa6Tgp9z
4Ml/L4FXlDBc1ohEZqJQqWouLOl0LjClMMPBv+rsThZZSBiRdYEUIXOQfrJv7tMi
WjosPJqtQ5Sp9QFxKTwbLayHVvbFyY095EyQVhy/7BY6+thaGGVYCjz0CcozJZhr
M/ebgF32Geu+IQtf1+hnXJCdQj4mEc5ALgz97qT7KXFwOpnZ0hT78dBooZY+5laD
0/pgrWwboNI3kRpDQw0PCeaq05Q3+ppLo++NZ1s+9vDUEW1uvzcjeSqv+Cm+wn++
3KmDDmVN2VuRTGQmBE9XdIqI0SYb65OXzzGaDoB8fvQzZgvMlhcCfrp+BtrjHWy0
UzNT4YScuZVTUgXJ49Hk+enihJAMGTyfOmwMo4eOaoQYxIuKayjtfIk8+CsCJWro
JlWMmPB0golG9EjO6cAK1zWN8gpzXuhATNsUvIqv2qHWzVSsh+rzDAf3xlsDO1rm
JIduKYyyuP9QaEtEjmLzCbER+4Qwzll3vsaelxfrQbs3xIbGKmm4vAzJPI/TVa5J
Y24v09FlMXw8bp+pw9XVC+V0fueRqMMD2+Log/ZTFnHbeX/uzguIswQBAQgAHRYh
BN0ovebZh1yrzkqLHdzPDbMLwQVIBQJfFc7nAAoJENzPDbMLwQVIehgD/3sqChU9
MHZfBv6LXzVixV8F68JW4UxKzEPOzYAr7MDmKgT1VN5gdltKCq+GYCgfD8CXepNw
qqL2K+DbapBBvTGpLXcJp36I0VbdOL04mshW6XMVJP33Cgyg9c5c569TiVV1R0Gk
GHr4eal/jedvaqlhit6qqmbWsI+ERHApnHQS
=6n8s
-----END PGP SIGNATURE-----

Z
Z
zimoun wrote on 20 Jul 2020 21:59
(name . Dr. Arne Babenhauserheide)(address . arne_bab@web.de)
CAJ3okZ2ndtsn5t38t+C_odoYDa-m8cdpFG9tnKC8FoKuoHXveA@mail.gmail.com
On Mon, 20 Jul 2020 at 19:05, Dr. Arne Babenhauserheide <arne_bab@web.de> wrote:
Toggle quote (12 lines)
> zimoun <zimon.toutoune@gmail.com> writes:
> >> > The format of metadata (disassemble) that you propose is schemish
> >> > (obviously! :-)) but we could propose something more JSON-like.
> >>
> >> Sure, if that helps get other people on-board, why not (though sexps
> >> have lived much longer than JSON and XML together :-)).
> >
> > Lived much longer and still less less less used than JSON or XML alone. ;-)
>
> Though this is likely not a function of the format, but of the
> popularity of both Javascript and Java.

Well, the popularity matters to attract a broad audience and maybe get
other people on-board; if it is the aim.
It seems the de-facto format; even if JSON has flaws. And zillions of
parsers for all the languages are floating around, which is not the
case for Sexp, even if it is easier to parse.

And JSON is already used in Guix, see [1] for an example.


However, I am not convinced that JSON or similarly Sexp will scale
well for a Tarball Heritage perspective.

All the best,
simon
Z
Z
zimoun wrote on 20 Jul 2020 23:27
865zahev23.fsf@gmail.com
Hi Chris,

On Mon, 13 Jul 2020 at 20:20, Christopher Baines <mail@cbaines.net> wrote:

Toggle quote (6 lines)
> Going forward, being methodical as a project about storing the tarballs
> and source material for the packages is probalby the way to ensure it's
> available for the future. I'm not sure the data storage cost is
> significant, the cost of doing this is probably in working out what to
> store, doing so in a redundant manor, and making the data available.

A really rough estimate is 120KB on average* per raw tarball. So if we
consider 14000 packages and 70% of them are url-fetch, then it leads to
14k*0.7*120K= 1.2GB; which is not significant. Moreover, if we
extrapolate the numbers, between v1.0.0 and now it is 23 commits per day
modifying gnu/packages/ so 0.7*23*120K*365= 700MB per year. However,
the 120KB of metadata to re-assemble the tarball have to be compared to
the 712KB of raw compressed tarball; both about the hello package.

*based on the hello package. And it depends on the number of files in
the tarball. File stored not compressed: plain sexp.


Therefore, in addition to what to store, redundancy and availability,
one question is how to store? Git-repo? SQL database? etc.



Toggle quote (5 lines)
> The Guix Data Service knows about fixed output derivations, so it might
> be possible to backfill such a store by just attempting to build those
> derivations. It might also be possible to use the Guix Data Service to
> work out what's available, and what tarballs are missing.

Missing from where? The substitutes farm or SWH?


Cheers,
simon
L
L
Ludovic Courtès wrote on 21 Jul 2020 23:22
Re: Recovering source tarballs
(name . zimoun)(address . zimon.toutoune@gmail.com)
87k0ywlg1z.fsf@gnu.org
Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (9 lines)
> On Mon, 20 Jul 2020 at 10:39, Ludovic Courtès <ludo@gnu.org> wrote:
>> zimoun <zimon.toutoune@gmail.com> skribis:
>> > On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> There are many many comments in your message, so I took the liberty to
>> reply only to the essence of it. :-)
>
> Many comments because many open topics. ;-)

Understood, and they’re very valuable but (1) I choose not to just do
email :-), and (2) I like to separate issues in reasonable chunks rather
than long threads addressing all the problems we’ll have to deal with.

I think it really helps keep things tractable!

Toggle quote (9 lines)
>> Lookup issue. :-) The hash in a CID is not just a raw blob hash.
>> Files are typically chunked beforehand, assembled as a Merkle tree, and
>> the CID is roughly the hash to the tree root. So it would seem we can’t
>> use IPFS as-is for tarballs.
>
> Using the Git-repo map/table, then it becomes an option, right?
> Well, SWH would be a backend and IPFS could be another one. Or any
> "cloudy" storage system that could appear in the future, right?

Sure, why not.

Toggle quote (12 lines)
>> >> • If we no longer deal with tarballs but upstreams keep signing
>> >> tarballs (not raw directory hashes), how can we authenticate our
>> >> code after the fact?
>> >
>> > Does Guix automatically authenticate code using signed tarballs?
>>
>> Not automatically; packagers are supposed to authenticate code when they
>> add a package (‘guix refresh -u’ does that automatically).
>
> So I miss the point of having this authentication information in the
> future where upstream has disappeared.

What I meant above, is that often, what we have is things like detached
signatures of raw tarballs, or documents referring to a tarball hash:


Toggle quote (5 lines)
>> But today, we store tarball hashes, not directory hashes.
>
> We store what "guix hash" returns. ;-)
> So it is easy to migrate from tarball hashes to whatever else. :-)

True, but that other thing, as it stands, would be a nar hash (like for
‘git-fetch’), not a Git-tree hash (what SWH uses).

Toggle quote (3 lines)
> I mean, it is "(sha256 (base32" and it is easy to have also
> "(sha256-tree (base32" or something like that.

Right, but that first and foremost requires daemon support.

It’s doable, but migration would have to take a long time, since this is
touching core parts of the “protocol”.

Toggle quote (10 lines)
> I have not done yet the clear back-to-envelop computations. Roughly,
> there are ~23 commits on average per day updating packages, so say 70%
> of them are url-fetch, it is ~16 new tarballs per day, on average.
> How the model using a Git-repo will scale? Because, naively the
> output of "disassemble-archive" in full text (pretty-print format) for
> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
> without considering all the Git internals. Obviously, it depends on
> the number of files and I do not know if hello is a representative
> example.

Interesting, thanks for making that calculation! We could make the
format more compact if needed.

Thanks,
Ludo’.
Z
Z
zimoun wrote on 22 Jul 2020 02:27
(name . Ludovic Courtès)(address . ludo@gnu.org)
86o8o81jic.fsf@gmail.com
Hi!

On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (17 lines)
>>> >> • If we no longer deal with tarballs but upstreams keep signing
>>> >> tarballs (not raw directory hashes), how can we authenticate our
>>> >> code after the fact?
>>> >
>>> > Does Guix automatically authenticate code using signed tarballs?
>>>
>>> Not automatically; packagers are supposed to authenticate code when they
>>> add a package (‘guix refresh -u’ does that automatically).
>>
>> So I miss the point of having this authentication information in the
>> future where upstream has disappeared.
>
> What I meant above, is that often, what we have is things like detached
> signatures of raw tarballs, or documents referring to a tarball hash:
>
> https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html

I still miss why it matters to store detached signature of raw tarballs.

The authentication is done now (at package time and/or inclusion in the
lookup table proposal). I miss why we would have to re-authenticate
again later.

IMHO, having a lookup table that returns the signatures from a tarball
hash or an archive of all the OpenGPG keys ever published is another
topic.


Toggle quote (8 lines)
>>> But today, we store tarball hashes, not directory hashes.
>>
>> We store what "guix hash" returns. ;-)
>> So it is easy to migrate from tarball hashes to whatever else. :-)
>
> True, but that other thing, as it stands, would be a nar hash (like for
> ‘git-fetch’), not a Git-tree hash (what SWH uses).

Ok, now I am totally convinced that a lookup table is The Right Thing™. :-)

Toggle quote (8 lines)
>> I mean, it is "(sha256 (base32" and it is easy to have also
>> "(sha256-tree (base32" or something like that.
>
> Right, but that first and foremost requires daemon support.
>
> It’s doable, but migration would have to take a long time, since this is
> touching core parts of the “protocol”.

Doable but not necessary tractable. :-)


Toggle quote (13 lines)
>> I have not done yet the clear back-to-envelop computations. Roughly,
>> there are ~23 commits on average per day updating packages, so say 70%
>> of them are url-fetch, it is ~16 new tarballs per day, on average.
>> How the model using a Git-repo will scale? Because, naively the
>> output of "disassemble-archive" in full text (pretty-print format) for
>> the hello-2.10.tar is 120KB and so 16*365*120K = ~700Mb per year
>> without considering all the Git internals. Obviously, it depends on
>> the number of files and I do not know if hello is a representative
>> example.
>
> Interesting, thanks for making that calculation! We could make the
> format more compact if needed.

Compressing should help.

Considering 14000 packages, based on this 120KB estimation, it leads to:
0.7*14k*120K= ~1.2GB for the Git-repo of the current Guix.

Cheers,
simon
L
L
Ludovic Courtès wrote on 22 Jul 2020 12:28
(name . zimoun)(address . zimon.toutoune@gmail.com)
875zafkfml.fsf@gnu.org
Hello!

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (21 lines)
> On Tue, 21 Jul 2020 at 23:22, Ludovic Courtès <ludo@gnu.org> wrote:
>
>>>> >> • If we no longer deal with tarballs but upstreams keep signing
>>>> >> tarballs (not raw directory hashes), how can we authenticate our
>>>> >> code after the fact?
>>>> >
>>>> > Does Guix automatically authenticate code using signed tarballs?
>>>>
>>>> Not automatically; packagers are supposed to authenticate code when they
>>>> add a package (‘guix refresh -u’ does that automatically).
>>>
>>> So I miss the point of having this authentication information in the
>>> future where upstream has disappeared.
>>
>> What I meant above, is that often, what we have is things like detached
>> signatures of raw tarballs, or documents referring to a tarball hash:
>>
>> https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/msg00009.html
>
> I still miss why it matters to store detached signature of raw tarballs.

I’m not saying we (Guix) should store signatures; I’m just saying that
developers typically sign raw tarballs. It’s a general statement to
explain why storing or being able to reconstruct tarballs matters.

Thanks,
Ludo’.
T
T
Timothy Sample wrote on 30 Jul 2020 19:36
Re: bug#42162: Recovering source tarballs
(name . Ludovic Courtès)(address . ludo@gnu.org)
875za4ykej.fsf@ngyro.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (71 lines)
> Hi,
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
> [...]
>
> So for the medium term, and perhaps for the future, a possible option
> would be to preserve tarball metadata so we can reconstruct them:
>
> tarball = metadata + tree
>
> After all, tarballs are byproducts and should be no exception: we should
> build them from source. :-)
>
> In <https://forge.softwareheritage.org/T2430>, Stefano mentioned
> pristine-tar, which does almost that, but not quite: it stores a binary
> delta between a tarball and a tree:
>
> https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html
>
> I think we should have something more transparent than a binary delta.
>
> The code below can “disassemble” and “assemble” a tar. When it
> disassembles it, it generates metadata like this:
>
> (tar-source
> (version 0)
> (headers
> (("guile-3.0.4/"
> (mode 493)
> (size 0)
> (mtime 1593007723)
> (chksum 3979)
> (typeflag #\5))
> ("guile-3.0.4/m4/"
> (mode 493)
> (size 0)
> (mtime 1593007720)
> (chksum 4184)
> (typeflag #\5))
> ("guile-3.0.4/m4/pipe2.m4"
> (mode 420)
> (size 531)
> (mtime 1536050419)
> (chksum 4812)
> (hash (sha256
> "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza")))
> ("guile-3.0.4/m4/time_h.m4"
> (mode 420)
> (size 5471)
> (mtime 1536050419)
> (chksum 4974)
> (hash (sha256
> "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka")))
> […]
>
> The ’assemble-archive’ procedure consumes that, looks up file contents
> by hash on SWH, and reconstructs the original tarball…
>
> … at least in theory, because in practice we hit the SWH rate limit
> after looking up a few files:
>
> https://archive.softwareheritage.org/api/#rate-limiting
>
> So it’s a bit ridiculous, but we may have to store a SWH “dir”
> identifier for the whole extracted tree—a Git-tree hash—since that would
> allow us to retrieve the whole thing in a single HTTP request.
>
> Besides, we’ll also have to handle compression: storing gzip/xz headers
> and compression levels.

This jumped out at me because I have been working with compression and
tarballs for the bootstrapping effort. I started pulling some threads
and doing some research, and ended up prototyping an end-to-end solution
for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
and an SWH directory ID. It can even put them back together! :) There
are a bunch of problems still, but I think this project is doable in the
short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
found and fixed a bunch of little gaffes. There’s a ton of work to do,
of course, but here’s another small step.

I call the thing “Disarchive” as in “disassemble a source code archive”.
You can find it at https://git.ngyro.com/disarchive/. It has a simple
command-line interface so you can do

$ disarchive save software-1.0.tar.gz

which serializes a disassembled version of “software-1.0.tar.gz” to the
database (which is just a directory) specified by the “DISARCHIVE_DB”
environment variable. Next, you can run

$ disarchive load hash-of-something-in-the-db

which will recover an original file from its metadata (stored in the
database) and data retrieved from the SWH archive or taken from a cache
(again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Now some implementation details. The way I’ve set it up is that all of
the assembly happens through Guix. Each step in recreating a compressed
tarball is a fixed-output derivation: the download from SWH, the
creation of the tarball, and the compression. I wanted an easy way to
build and verify things according to a dependency graph without writing
any code. Hi Guix Daemon! I’m not sure if this is a good long-term
approach, though. It could work well for reproducibility, but it might
be easier to let some external service drive my code as a Guix package.
Either way, it was an easy way to get started.

For disassembly, it takes a Gzip file (containing a single member) and
breaks it down like this:

(gzip-member
(version 0)
(name "hungrycat-0.4.1.tar.gz")
(input (sha256
"1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
(header
(mtime 0)
(extra-flags 2)
(os 3))
(footer
(crc 3863610951)
(isize 194560))
(compressor gnu-best)
(digest
(sha256
"03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

The header and footer are read directly from the file. Finding the
compressor is harder. I followed the approach taken by the pristine-tar
project. That is, try a bunch of compressors and hope for a match.
Currently, I have:

• gnu-best
• gnu-best-rsync
• gnu
• gnu-rsync
• gnu-fast
• gnu-fast-rsync
• zlib-best
• zlib
• zlib-fast
• zlib-best-perl
• zlib-perl
• zlib-fast-perl
• gnu-best-rsync-1.4
• gnu-rsync-1.4
• gnu-fast-rsync-1.4

This list is inspired by pristine-tar. The first couple GNU compressors
use modern Gzip from Guix. The zlib and rsync-1.4 ones use the Gzip and
zlib wrapper from pristine-tar called “zgz”. The 100 Gzip files I
looked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”,
“zlib-best”, and “zlib-fast-perl”.

(As an aside, I had a way to decompose multi-member Gzip files, but it
was much, much slower. Since I doubt they exist in the wild, I removed
that code.)

The “input” field likely points to a tarball, which looks like this:

(tarball
(version 0)
(name "hungrycat-0.4.1.tar")
(input (sha256
"02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
(default-header)
(headers
((name "hungrycat-0.4.1/")
(mode 493)
(mtime 1513360022)
(chksum 5058)
(typeflag 53))
((name "hungrycat-0.4.1/configure")
(mode 493)
(size 130263)
(mtime 1513360022)
(chksum 6043))
...)
(padding 3584)
(digest
(sha256
"1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))

Originally, I used your code, but I ran into some problems. Namely,
real tarballs are not well-behaved. I wrote new code to keep track of
subtle things like the formatting of the octal values. Even though they
are not well-behaved, they are usually self-consistent, so I introduced
the “default-header” field to set default values for all headers. Any
omitted fields in the headers use the value from the default header, and
the default header takes defaults from a “default default header”
defined in the code. Here’s a default header from a different tarball:

(default-header
(uid 1199)
(gid 30)
(magic "ustar ")
(version " \x00")
(uname "cagordon")
(gname "lhea")
(devmajor-format (width 0))
(devminor-format (width 0)))

These default values are computed to minimize the noise in the
serialized form. Here we see for example that each header should have
UID 1199 unless otherwise specified. We also see that the device fields
should be null strings instead of octal zeros. Another good example
here is that the magic field has a space after “ustar”, which is not
what modern POSIX says to do.

My tarball reader has minimal support for extended headers, but they are
not serialized cleanly (they survive the round-trip, but they are not
human-readable).

Finally, the “input” field here points to an “swh-directory” object. It
looks like this:

(swh-directory
(version 0)
(name "hungrycat-0.4.1")
(id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
(digest
(sha256
"02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

I have a little module for computing the directory hash like SWH does
(which is in-turn like what Git does). I did not verify that the 100
packages where in the SWH archive. I did verify a couple of packages,
but I hit the rate limit and decided to avoid it for now.

To avoid hitting the SWH archive at all, I introduced a directory cache
so that I can store the directories locally. If the directory cache is
available, directories are stored and retrieved from it.

Toggle quote (8 lines)
> How would we put that in practice? Good question. :-)
>
> I think we’d have to maintain a database that maps tarball hashes to
> metadata (!). A simple version of it could be a Git repo where, say,
> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
> contain the metadata above. The nice thing is that the Git repo itself
> could be archived by SWH. :-)


This was generated by a little script built on top of “fold-packages”.
It downloads Gzip’d tarballs used by Guix packages and passes them on to
Disarchive for disassembly. I limited the number to 100 because it’s
slow and because I’m sure there is a long tail of weird software
archives that are going to be hard to process. The metadata directory
ended up being 13M and the directory cache 2G.

Toggle quote (5 lines)
> Thus, if a tarball vanishes, we’d look it up in the database and
> reconstruct it from its metadata plus content store in SWH.
>
> Thoughts?

Obviously I like the idea. ;)

Even with the code I have so far, I have a lot of questions. Mainly I’m
worried about keeping everything working into the future. It would be
easy to make incompatible changes. A lot of care would have to be
taken. Of course, keeping a Guix commit and a Disarchive commit might
be enough to make any assembling reproducible, but there’s a
chicken-and-egg problem there. What if a tarball from the closure of
one the derivations is missing? I guess you could work around it, but
it would be tricky.

Toggle quote (4 lines)
> Anyhow, we should team up with fellow NixOS and SWH hackers to address
> this, and with developers of other distros as well—this problem is not
> just that of the functional deployment geeks, is it?

I could remove most of the Guix stuff so that it would be easy to
package in Guix, Nix, Debian, etc. Then, someone™ could write a service
that consumes a “sources.json” file, adds the sources to a Disarchive
database, and pushes everything to a Git repo. I guess everyone who
cares has to produce a “sources.json” file anyway, so it will be very
little extra work. Other stuff like changing the serialization format
to JSON would be pretty easy, too. I’m not well connected to these
other projects, mind you, so I’m not really sure how to reach out.

Sorry about the big mess of code and ideas – I realize I may have taken
the “do-ocracy” approach a little far here. :) Even if this is not
“the” solution, hopefully it’s useful for discussion!


-- Tim
L
L
Ludovic Courtès wrote on 31 Jul 2020 16:41
(name . Timothy Sample)(address . samplet@ngyro.com)
87bljvu4p4.fsf@gnu.org
Hi Timothy!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (26 lines)
> This jumped out at me because I have been working with compression and
> tarballs for the bootstrapping effort. I started pulling some threads
> and doing some research, and ended up prototyping an end-to-end solution
> for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
> and an SWH directory ID. It can even put them back together! :) There
> are a bunch of problems still, but I think this project is doable in the
> short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
> found and fixed a bunch of little gaffes. There’s a ton of work to do,
> of course, but here’s another small step.
>
> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>. It has a simple
> command-line interface so you can do
>
> $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable. Next, you can run
>
> $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Wooohoo! Is it that time of the year when people give presents to one
another? I can’t believe it. :-)

Toggle quote (30 lines)
> Now some implementation details. The way I’ve set it up is that all of
> the assembly happens through Guix. Each step in recreating a compressed
> tarball is a fixed-output derivation: the download from SWH, the
> creation of the tarball, and the compression. I wanted an easy way to
> build and verify things according to a dependency graph without writing
> any code. Hi Guix Daemon! I’m not sure if this is a good long-term
> approach, though. It could work well for reproducibility, but it might
> be easier to let some external service drive my code as a Guix package.
> Either way, it was an easy way to get started.
>
> For disassembly, it takes a Gzip file (containing a single member) and
> breaks it down like this:
>
> (gzip-member
> (version 0)
> (name "hungrycat-0.4.1.tar.gz")
> (input (sha256
> "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
> (header
> (mtime 0)
> (extra-flags 2)
> (os 3))
> (footer
> (crc 3863610951)
> (isize 194560))
> (compressor gnu-best)
> (digest
> (sha256
> "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

Awesome.

Toggle quote (21 lines)
> The header and footer are read directly from the file. Finding the
> compressor is harder. I followed the approach taken by the pristine-tar
> project. That is, try a bunch of compressors and hope for a match.
> Currently, I have:
>
> • gnu-best
> • gnu-best-rsync
> • gnu
> • gnu-rsync
> • gnu-fast
> • gnu-fast-rsync
> • zlib-best
> • zlib
> • zlib-fast
> • zlib-best-perl
> • zlib-perl
> • zlib-fast-perl
> • gnu-best-rsync-1.4
> • gnu-rsync-1.4
> • gnu-fast-rsync-1.4

I would have used the integers that zlib supports, but I guess that
doesn’t capture this whole gamut of compression setups. And yeah, it’s
not great that we actually have to try and find the right compression
levels, but there’s no way around it it seems, and as you write, we can
expect a couple of variants to be the most commonly used ones.

Toggle quote (29 lines)
> The “input” field likely points to a tarball, which looks like this:
>
> (tarball
> (version 0)
> (name "hungrycat-0.4.1.tar")
> (input (sha256
> "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
> (default-header)
> (headers
> ((name "hungrycat-0.4.1/")
> (mode 493)
> (mtime 1513360022)
> (chksum 5058)
> (typeflag 53))
> ((name "hungrycat-0.4.1/configure")
> (mode 493)
> (size 130263)
> (mtime 1513360022)
> (chksum 6043))
> ...)
> (padding 3584)
> (digest
> (sha256
> "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))
>
> Originally, I used your code, but I ran into some problems. Namely,
> real tarballs are not well-behaved. I wrote new code to keep track of
> subtle things like the formatting of the octal values.

Yeah I guess I was too optimistic. :-) I wanted to have the
serialization/deserialization code automatically generated by that
macro, but yeah, it doesn’t capture enough details for real-world
tarballs.

Do you know how frequently you get “weird” tarballs? I was thinking
about having something that works for plain GNU tar, but it’s even
better to have something that works with “unusual” tarballs!

(BTW the code I posted or the one in Disarchive could perhaps replace
the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’
procedure there, notably.)

Toggle quote (17 lines)
> Even though they are not well-behaved, they are usually
> self-consistent, so I introduced the “default-header” field to set
> default values for all headers. Any omitted fields in the headers use
> the value from the default header, and the default header takes
> defaults from a “default default header” defined in the code. Here’s
> a default header from a different tarball:
>
> (default-header
> (uid 1199)
> (gid 30)
> (magic "ustar ")
> (version " \x00")
> (uname "cagordon")
> (gname "lhea")
> (devmajor-format (width 0))
> (devminor-format (width 0)))

Very nice.

Toggle quote (11 lines)
> Finally, the “input” field here points to an “swh-directory” object. It
> looks like this:
>
> (swh-directory
> (version 0)
> (name "hungrycat-0.4.1")
> (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
> (digest
> (sha256
> "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

Yay!

Toggle quote (9 lines)
> I have a little module for computing the directory hash like SWH does
> (which is in-turn like what Git does). I did not verify that the 100
> packages where in the SWH archive. I did verify a couple of packages,
> but I hit the rate limit and decided to avoid it for now.
>
> To avoid hitting the SWH archive at all, I introduced a directory cache
> so that I can store the directories locally. If the directory cache is
> available, directories are stored and retrieved from it.

I guess we can get back to them eventually to estimate our coverage ratio.

Toggle quote (8 lines)
>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!). A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above. The nice thing is that the Git repo itself
>> could be archived by SWH. :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>? :)

Woow. :-)

We could actually have a CI job to create the database: it would
basically do ‘disarchive save’ for each tarball and store that using a
layout like the one you used. Then we could have a job somewhere that
periodically fetches that and adds it to the database. WDYT?

I think we should leave room for other hash algorithms (in the sexps
above too).

Toggle quote (7 lines)
> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly. I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process. The metadata directory
> ended up being 13M and the directory cache 2G.

Neat.

So it does mean that we could pretty much right away add a fall-back in
(guix download) that looks up tarballs in your database and uses
Disarchive to recontruct it, right? I love solved problems. :-)

Of course we could improve Disarchive and the database, but it seems to
me that we already have enough to improve the situation. WDYT?

Toggle quote (7 lines)
> Even with the code I have so far, I have a lot of questions. Mainly I’m
> worried about keeping everything working into the future. It would be
> easy to make incompatible changes. A lot of care would have to be
> taken. Of course, keeping a Guix commit and a Disarchive commit might
> be enough to make any assembling reproducible, but there’s a
> chicken-and-egg problem there.

The way I see it, Guix would always look up tarballs in the HEAD of the
database (no need to pick a specific commit). Worst that could happen
is we reconstruct a tarball that doesn’t match, and so the daemon errors
out.

Regarding future-proofness, I think we must be super careful about the
file formats (the sexps). You did pay attention to not having implicit
defaults, which is perfect. Perhaps one thing to change (or perhaps
it’s already there) is support for other hashes in those sexps: both
hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
tree with different hash algorithm, IPFS CID, etc.). Also the ability
to specify several hashes.

That way we could “refresh” the database anytime by adding the hash du
jour for already-present tarballs.

Toggle quote (3 lines)
> What if a tarball from the closure of one the derivations is missing?
> I guess you could work around it, but it would be tricky.

Well, more generally, we’ll have to monitor archive coverage. But I
don’t think the issue is specific to this method.

Toggle quote (13 lines)
>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>> this, and with developers of other distros as well—this problem is not
>> just that of the functional deployment geeks, is it?
>
> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc. Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo. I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work. Other stuff like changing the serialization format
> to JSON would be pretty easy, too. I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

If you feel like it, you’re welcome to point them to your work in the
discussion at https://forge.softwareheritage.org/T2430. There’s one
person from NixOS (lewo) participating in the discussion and I’m sure
they’d be interested. Perhaps they’ll tell whether they care about
having it available as JSON.

Toggle quote (4 lines)
> Sorry about the big mess of code and ideas – I realize I may have taken
> the “do-ocracy” approach a little far here. :) Even if this is not
> “the” solution, hopefully it’s useful for discussion!

You did great! I had a very rough sketch and you did the real thing,
that’s just awesome. :-)

Thanks a lot!

Ludo’.
T
T
Timothy Sample wrote on 3 Aug 2020 18:59
(name . Ludovic Courtès)(address . ludo@gnu.org)
87d047u0l3.fsf@ngyro.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (3 lines)
> Wooohoo! Is it that time of the year when people give presents to one
> another? I can’t believe it. :-)

Not to be too cynical, but I think it’s just the time of year that I get
frustrated with what I should be working on, and start fantasizing about
green-field projects. :p

Toggle quote (29 lines)
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> The header and footer are read directly from the file. Finding the
>> compressor is harder. I followed the approach taken by the pristine-tar
>> project. That is, try a bunch of compressors and hope for a match.
>> Currently, I have:
>>
>> • gnu-best
>> • gnu-best-rsync
>> • gnu
>> • gnu-rsync
>> • gnu-fast
>> • gnu-fast-rsync
>> • zlib-best
>> • zlib
>> • zlib-fast
>> • zlib-best-perl
>> • zlib-perl
>> • zlib-fast-perl
>> • gnu-best-rsync-1.4
>> • gnu-rsync-1.4
>> • gnu-fast-rsync-1.4
>
> I would have used the integers that zlib supports, but I guess that
> doesn’t capture this whole gamut of compression setups. And yeah, it’s
> not great that we actually have to try and find the right compression
> levels, but there’s no way around it it seems, and as you write, we can
> expect a couple of variants to be the most commonly used ones.

My first instinct was “this is impossible – a DEFLATE compressor can do
just about whatever it wants!” Then I looked at pristine-tar and
realized that their hack probably works pretty well. If I had infinite
time, I would think about some kind of fully general, parameterized LZ77
algorithm that could describe any implementation. If I had a lot of
time I would peel back the curtain on Gzip and zlib and expose their
tuning parameters. That would be nicer, but keep in mind we will have
to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between
quality and coverage. Any improvement to the representation of the
compression algorithm could be implemented easily: just replace the
names with their improved representation.

One thing pristine-tar does is reorder the compressor list based on the
input metadata. A Gzip member usually stores its compression level, so
it makes sense to try everything at that level first before moving one.

Toggle quote (9 lines)
>> Originally, I used your code, but I ran into some problems. Namely,
>> real tarballs are not well-behaved. I wrote new code to keep track of
>> subtle things like the formatting of the octal values.
>
> Yeah I guess I was too optimistic. :-) I wanted to have the
> serialization/deserialization code automatically generated by that
> macro, but yeah, it doesn’t capture enough details for real-world
> tarballs.

I enjoyed your implementation! I might even bring back its style. It
was a little stiff for trying to figure out exactly what I needed for
reproducing the tarballs.

Toggle quote (4 lines)
> Do you know how frequently you get “weird” tarballs? I was thinking
> about having something that works for plain GNU tar, but it’s even
> better to have something that works with “unusual” tarballs!

I don’t have hard numbers, but I would say that a good handful (5–10%)
have “X-format” fields, meaning their octal formatting is unusual. (I’m
looking at “grep -A 10 default-header” over all the S-Exp files.) The
most charming thing is the “uname” and “gname” fields. For example,
“rtmidi-4.0.0” was made by “gary” from “staff”. :)

Toggle quote (4 lines)
> (BTW the code I posted or the one in Disarchive could perhaps replace
> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’
> procedure there, notably.)

I really like “fold-archive”. One of the reasons I started doing this
is to possibly share code with Gash-Utils. It’s not as easy as I was
hoping, but I’m planning on improving things there based on my
experience here. I’ve now worked with four Scheme tar implementations,
maybe if I write a really good one I could cap that number at five!

Toggle quote (6 lines)
>> To avoid hitting the SWH archive at all, I introduced a directory cache
>> so that I can store the directories locally. If the directory cache is
>> available, directories are stored and retrieved from it.
>
> I guess we can get back to them eventually to estimate our coverage ratio.

It would be nice to know, but pretty hard to find out with the rate
limit. I guess it will improve immensely when we set up a
“sources.json” file.

Toggle quote (9 lines)
>
> Woow. :-)
>
> We could actually have a CI job to create the database: it would
> basically do ‘disarchive save’ for each tarball and store that using a
> layout like the one you used. Then we could have a job somewhere that
> periodically fetches that and adds it to the database. WDYT?

Maybe.... I assume that Disarchive would fail for a few of them. We
would need a plan for monitoring those failures so that Disarchive can
be improved. Also, unless I’m misunderstanding something, this means
building the whole database at every commit, no? That would take a lot
of time and space. On the other hand, it would be easy enough to try.
If it works, it’s a lot easier than setting up a whole other service.

Toggle quote (3 lines)
> I think we should leave room for other hash algorithms (in the sexps
> above too).

It works for different hash algorithms, but not for different directory
hashing methods (like you mention below).

Toggle quote (16 lines)
>> This was generated by a little script built on top of “fold-packages”.
>> It downloads Gzip’d tarballs used by Guix packages and passes them on to
>> Disarchive for disassembly. I limited the number to 100 because it’s
>> slow and because I’m sure there is a long tail of weird software
>> archives that are going to be hard to process. The metadata directory
>> ended up being 13M and the directory cache 2G.
>
> Neat.
>
> So it does mean that we could pretty much right away add a fall-back in
> (guix download) that looks up tarballs in your database and uses
> Disarchive to recontruct it, right? I love solved problems. :-)
>
> Of course we could improve Disarchive and the database, but it seems to
> me that we already have enough to improve the situation. WDYT?

I would say that we are darn close! In theory it would work. It would
be much more practical if we had better coverage in the SWH archive
(i.e., “sources.json”) and a way to get metadata for a source archive
without downloading the entire Disarchive database. It’s 13M now, but
it will likely be 500M with all the Gzip’d tarballs from a recent commit
of Guix. It will only grow after that, too.

Of course those are not hard blockers, so ‘(guix download)’ could start
using Disarchive as soon as we package it. I’ve starting looking into
it, but I’m confused about getting access to Disarchive from the
“out-of-band” download system. Would it have to become a dependency of
Guix?

Toggle quote (12 lines)
>> Even with the code I have so far, I have a lot of questions. Mainly I’m
>> worried about keeping everything working into the future. It would be
>> easy to make incompatible changes. A lot of care would have to be
>> taken. Of course, keeping a Guix commit and a Disarchive commit might
>> be enough to make any assembling reproducible, but there’s a
>> chicken-and-egg problem there.
>
> The way I see it, Guix would always look up tarballs in the HEAD of the
> database (no need to pick a specific commit). Worst that could happen
> is we reconstruct a tarball that doesn’t match, and so the daemon errors
> out.

I was imagining an escape hatch beyond this, where one could look up a
provenance record from when Disarchive ingested and verified a source
code archive. The provenance record would tell you which version of
Guix was used when saving the archive, so you could try your luck with
using “guix time-machine” to reproduce Disarchive’s original
computation. If we perform database migrations, you would need to
travel back in time in the database, too. The idea is that you could
work around breakages in Disarchive automatically using the Power of
Guix™. Just a stray thought, really.

Toggle quote (11 lines)
> Regarding future-proofness, I think we must be super careful about the
> file formats (the sexps). You did pay attention to not having implicit
> defaults, which is perfect. Perhaps one thing to change (or perhaps
> it’s already there) is support for other hashes in those sexps: both
> hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
> tree with different hash algorithm, IPFS CID, etc.). Also the ability
> to specify several hashes.
>
> That way we could “refresh” the database anytime by adding the hash du
> jour for already-present tarballs.

The hash algorithm is already configurable, but the directory hash
method is not. You’re right that it should be, and that there should be
support for multiple digests.

Toggle quote (6 lines)
>> What if a tarball from the closure of one the derivations is missing?
>> I guess you could work around it, but it would be tricky.
>
> Well, more generally, we’ll have to monitor archive coverage. But I
> don’t think the issue is specific to this method.

Again, I’m thinking about the case where I want to travel back in time
to reproduce a Disarchive computation. It’s really an unlikely
scenario, I’m just trying to think of everything that could go wrong.

Toggle quote (19 lines)
>>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>>> this, and with developers of other distros as well—this problem is not
>>> just that of the functional deployment geeks, is it?
>>
>> I could remove most of the Guix stuff so that it would be easy to
>> package in Guix, Nix, Debian, etc. Then, someone™ could write a service
>> that consumes a “sources.json” file, adds the sources to a Disarchive
>> database, and pushes everything to a Git repo. I guess everyone who
>> cares has to produce a “sources.json” file anyway, so it will be very
>> little extra work. Other stuff like changing the serialization format
>> to JSON would be pretty easy, too. I’m not well connected to these
>> other projects, mind you, so I’m not really sure how to reach out.
>
> If you feel like it, you’re welcome to point them to your work in the
> discussion at <https://forge.softwareheritage.org/T2430>. There’s one
> person from NixOS (lewo) participating in the discussion and I’m sure
> they’d be interested. Perhaps they’ll tell whether they care about
> having it available as JSON.

Good idea. I will work out a few more kinks and then bring it up there.
I’ve already rewritten the parts that used the Guix daemon. Disarchive
now only needs a handful Guix modules ('base32', 'serialization', and
'swh' are the ones that would be hard to remove).

Toggle quote (9 lines)
>> Sorry about the big mess of code and ideas – I realize I may have taken
>> the “do-ocracy” approach a little far here. :) Even if this is not
>> “the” solution, hopefully it’s useful for discussion!
>
> You did great! I had a very rough sketch and you did the real thing,
> that’s just awesome. :-)
>
> Thanks a lot!

My pleasure! Thanks for the feedback so far.


-- Tim
R
R
Ricardo Wurmus wrote on 3 Aug 2020 23:10
(name . zimoun)(address . zimon.toutoune@gmail.com)(address . 42162@debbugs.gnu.org)
87r1snfnb1.fsf@elephly.net
zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (5 lines)
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch". Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.

We should do that (and soon), especially because Bioconductor does not
keep an archive of old releases. We can discuss this on a separate
issue lest we derail the discussion at hand.

--
Ricardo
L
L
Ludovic Courtès wrote on 5 Aug 2020 19:14
(name . Timothy Sample)(address . samplet@ngyro.com)
87wo2dnhgb.fsf@gnu.org
Hello!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (9 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> Wooohoo! Is it that time of the year when people give presents to one
>> another? I can’t believe it. :-)
>
> Not to be too cynical, but I think it’s just the time of year that I get
> frustrated with what I should be working on, and start fantasizing about
> green-field projects. :p

:-)

Toggle quote (41 lines)
>> Timothy Sample <samplet@ngyro.com> skribis:
>>
>>> The header and footer are read directly from the file. Finding the
>>> compressor is harder. I followed the approach taken by the pristine-tar
>>> project. That is, try a bunch of compressors and hope for a match.
>>> Currently, I have:
>>>
>>> • gnu-best
>>> • gnu-best-rsync
>>> • gnu
>>> • gnu-rsync
>>> • gnu-fast
>>> • gnu-fast-rsync
>>> • zlib-best
>>> • zlib
>>> • zlib-fast
>>> • zlib-best-perl
>>> • zlib-perl
>>> • zlib-fast-perl
>>> • gnu-best-rsync-1.4
>>> • gnu-rsync-1.4
>>> • gnu-fast-rsync-1.4
>>
>> I would have used the integers that zlib supports, but I guess that
>> doesn’t capture this whole gamut of compression setups. And yeah, it’s
>> not great that we actually have to try and find the right compression
>> levels, but there’s no way around it it seems, and as you write, we can
>> expect a couple of variants to be the most commonly used ones.
>
> My first instinct was “this is impossible – a DEFLATE compressor can do
> just about whatever it wants!” Then I looked at pristine-tar and
> realized that their hack probably works pretty well. If I had infinite
> time, I would think about some kind of fully general, parameterized LZ77
> algorithm that could describe any implementation. If I had a lot of
> time I would peel back the curtain on Gzip and zlib and expose their
> tuning parameters. That would be nicer, but keep in mind we will have
> to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between
> quality and coverage. Any improvement to the representation of the
> compression algorithm could be implemented easily: just replace the
> names with their improved representation.

Yup, it makes sense to not spend too much time on this bit. I guess
we’d already have good coverage with gzip and xz.

Toggle quote (10 lines)
>> (BTW the code I posted or the one in Disarchive could perhaps replace
>> the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’
>> procedure there, notably.)
>
> I really like “fold-archive”. One of the reasons I started doing this
> is to possibly share code with Gash-Utils. It’s not as easy as I was
> hoping, but I’m planning on improving things there based on my
> experience here. I’ve now worked with four Scheme tar implementations,
> maybe if I write a really good one I could cap that number at five!

Heh. :-) The needs are different anyway. In Gash-Utils the focus is
probably on simplicity/maintainability, whereas here you really want to
cover all the details of the wire representation.

Toggle quote (10 lines)
>>> To avoid hitting the SWH archive at all, I introduced a directory cache
>>> so that I can store the directories locally. If the directory cache is
>>> available, directories are stored and retrieved from it.
>>
>> I guess we can get back to them eventually to estimate our coverage ratio.
>
> It would be nice to know, but pretty hard to find out with the rate
> limit. I guess it will improve immensely when we set up a
> “sources.json” file.

Note that we have https://guix.gnu.org/sources.json. Last I checked,
SWH was ingesting it in its “qualification” instance, so it should be
ingesting it for good real soon if it’s not doing it already.

Toggle quote (16 lines)
>>
>> Woow. :-)
>>
>> We could actually have a CI job to create the database: it would
>> basically do ‘disarchive save’ for each tarball and store that using a
>> layout like the one you used. Then we could have a job somewhere that
>> periodically fetches that and adds it to the database. WDYT?
>
> Maybe.... I assume that Disarchive would fail for a few of them. We
> would need a plan for monitoring those failures so that Disarchive can
> be improved. Also, unless I’m misunderstanding something, this means
> building the whole database at every commit, no? That would take a lot
> of time and space. On the other hand, it would be easy enough to try.
> If it works, it’s a lot easier than setting up a whole other service.

One can easily write a procedure that takes a tarball and returns a
<computed-file> that builds its database entry. So at each commit, we’d
just rebuild things that have changed.

Toggle quote (6 lines)
>> I think we should leave room for other hash algorithms (in the sexps
>> above too).
>
> It works for different hash algorithms, but not for different directory
> hashing methods (like you mention below).

OK.

[...]

Toggle quote (14 lines)
>> So it does mean that we could pretty much right away add a fall-back in
>> (guix download) that looks up tarballs in your database and uses
>> Disarchive to recontruct it, right? I love solved problems. :-)
>>
>> Of course we could improve Disarchive and the database, but it seems to
>> me that we already have enough to improve the situation. WDYT?
>
> I would say that we are darn close! In theory it would work. It would
> be much more practical if we had better coverage in the SWH archive
> (i.e., “sources.json”) and a way to get metadata for a source archive
> without downloading the entire Disarchive database. It’s 13M now, but
> it will likely be 500M with all the Gzip’d tarballs from a recent commit
> of Guix. It will only grow after that, too.

If we expose the database over HTTP (like over cgit), we can arrange so
that (guix download) simply GETs db.example.org/sha256/xyz. No need to
fetch the whole database.

It might be more reasonable to have a real database and a real service
around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
that could easily be implemented by a “real” service instead of cgit in
the future.

Toggle quote (6 lines)
> Of course those are not hard blockers, so ‘(guix download)’ could start
> using Disarchive as soon as we package it. I’ve starting looking into
> it, but I’m confused about getting access to Disarchive from the
> “out-of-band” download system. Would it have to become a dependency of
> Guix?

Yes. It could be a behind-the-scenes dependency of “builtin:download”;
it doesn’t have to be a dependency of each and every fixed-output
derivation.

Toggle quote (10 lines)
> I was imagining an escape hatch beyond this, where one could look up a
> provenance record from when Disarchive ingested and verified a source
> code archive. The provenance record would tell you which version of
> Guix was used when saving the archive, so you could try your luck with
> using “guix time-machine” to reproduce Disarchive’s original
> computation. If we perform database migrations, you would need to
> travel back in time in the database, too. The idea is that you could
> work around breakages in Disarchive automatically using the Power of
> Guix™. Just a stray thought, really.

Seems to me it Shouldn’t Be Necessary? :-)

I mean, as long as the format is extensible and “future-proof”, we’ll
always be able to rebuild tarballs and then re-disassemble them if we
need to compute new hashes or whatever.

Toggle quote (11 lines)
>> If you feel like it, you’re welcome to point them to your work in the
>> discussion at <https://forge.softwareheritage.org/T2430>. There’s one
>> person from NixOS (lewo) participating in the discussion and I’m sure
>> they’d be interested. Perhaps they’ll tell whether they care about
>> having it available as JSON.
>
> Good idea. I will work out a few more kinks and then bring it up there.
> I’ve already rewritten the parts that used the Guix daemon. Disarchive
> now only needs a handful Guix modules ('base32', 'serialization', and
> 'swh' are the ones that would be hard to remove).

An option would be to use (gcrypt base64); another one would be to
bundle (guix base32).

I was thinking that it might be best to not use Guix for computations.
For example, have “disarchive save” not build derivations and instead do
everything “here and now”. That would make it easier for others to
adopt. Wait, looking at the Git history, it looks like you already
addressed that point, neat. :-)

Thank you!

Ludo’.
T
T
Timothy Sample wrote on 5 Aug 2020 20:57
(name . Ludovic Courtès)(address . ludo@gnu.org)
874kpgudic.fsf@ngyro.com
Hey,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (4 lines)
> Note that we have https://guix.gnu.org/sources.json. Last I checked,
> SWH was ingesting it in its “qualification” instance, so it should be
> ingesting it for good real soon if it’s not doing it already.

Oh fantastic! I was going to volunteer to do it, so that’s one thing
off my list.

Toggle quote (4 lines)
> One can easily write a procedure that takes a tarball and returns a
> <computed-file> that builds its database entry. So at each commit, we’d
> just rebuild things that have changed.

That makes more sense. I will give this a shot soon.

Toggle quote (9 lines)
> If we expose the database over HTTP (like over cgit), we can arrange so
> that (guix download) simply GETs db.example.org/sha256/xyz. No need to
> fetch the whole database.
>
> It might be more reasonable to have a real database and a real service
> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
> that could easily be implemented by a “real” service instead of cgit in
> the future.

I got it working over cgit shortly after sending my last message. :) So
far, I am very much on team “good enough for now”.

Toggle quote (18 lines)
> Timothy Sample <samplet@ngyro.com> skribis:
>
>> I was imagining an escape hatch beyond this, where one could look up a
>> provenance record from when Disarchive ingested and verified a source
>> code archive. The provenance record would tell you which version of
>> Guix was used when saving the archive, so you could try your luck with
>> using “guix time-machine” to reproduce Disarchive’s original
>> computation. If we perform database migrations, you would need to
>> travel back in time in the database, too. The idea is that you could
>> work around breakages in Disarchive automatically using the Power of
>> Guix™. Just a stray thought, really.
>
> Seems to me it Shouldn’t Be Necessary? :-)
>
> I mean, as long as the format is extensible and “future-proof”, we’ll
> always be able to rebuild tarballs and then re-disassemble them if we
> need to compute new hashes or whatever.

If Disarchive relies on external compressors, there’s an outside chance
that those compressors could change under our feet. In that case, one
would want to be able to track down exactly which version of XZ was used
when Disarchive verified that it could reassemble a given source
archive. Maybe I’m being paranoid, but if the database entries are
being computed by the CI infrastructure it would be pretty easy to note
the Guix commit just in case.

Toggle quote (6 lines)
> I was thinking that it might be best to not use Guix for computations.
> For example, have “disarchive save” not build derivations and instead do
> everything “here and now”. That would make it easier for others to
> adopt. Wait, looking at the Git history, it looks like you already
> addressed that point, neat. :-)

Since my last message I managed to remove Guix as dependency completely.
Right now it loads ‘(guix swh)’ opportunistically, but I might just copy
the code in. Directory references now support multiple “addresses” so
that you could have Nix-style, SWH-style, IPFS-style, etc. Hopefully my
next message will have a WIP patch enabling Guix to use Disarchive!


-- Tim
L
L
Ludovic Courtès wrote on 23 Aug 2020 18:21
(name . Timothy Sample)(address . samplet@ngyro.com)
87r1rxbafi.fsf@gnu.org
Hello!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (12 lines)
>> If we expose the database over HTTP (like over cgit), we can arrange so
>> that (guix download) simply GETs db.example.org/sha256/xyz. No need to
>> fetch the whole database.
>>
>> It might be more reasonable to have a real database and a real service
>> around it, I’m sure Chris Baines would agree ;-), but we can choose URLs
>> that could easily be implemented by a “real” service instead of cgit in
>> the future.
>
> I got it working over cgit shortly after sending my last message. :) So
> far, I am very much on team “good enough for now”.

Wonderful. :-)

Toggle quote (24 lines)
>> Timothy Sample <samplet@ngyro.com> skribis:
>>
>>> I was imagining an escape hatch beyond this, where one could look up a
>>> provenance record from when Disarchive ingested and verified a source
>>> code archive. The provenance record would tell you which version of
>>> Guix was used when saving the archive, so you could try your luck with
>>> using “guix time-machine” to reproduce Disarchive’s original
>>> computation. If we perform database migrations, you would need to
>>> travel back in time in the database, too. The idea is that you could
>>> work around breakages in Disarchive automatically using the Power of
>>> Guix™. Just a stray thought, really.
>>
>> Seems to me it Shouldn’t Be Necessary? :-)
>>
>> I mean, as long as the format is extensible and “future-proof”, we’ll
>> always be able to rebuild tarballs and then re-disassemble them if we
>> need to compute new hashes or whatever.
>
> If Disarchive relies on external compressors, there’s an outside chance
> that those compressors could change under our feet. In that case, one
> would want to be able to track down exactly which version of XZ was used
> when Disarchive verified that it could reassemble a given source
> archive.

Oh, true. Gzip and bzip2 are more-or-less “set in stone”, but xz, lzip,
or zstd could change. Recording the exact version of the implementation
would be a good stopgap.

Toggle quote (4 lines)
> Maybe I’m being paranoid, but if the database entries are being
> computed by the CI infrastructure it would be pretty easy to note the
> Guix commit just in case.

Yeah, that makes sense. At least we could have “notes” in the file
format to store that kind of info. Using CI is also a good idea.

Toggle quote (12 lines)
>> I was thinking that it might be best to not use Guix for computations.
>> For example, have “disarchive save” not build derivations and instead do
>> everything “here and now”. That would make it easier for others to
>> adopt. Wait, looking at the Git history, it looks like you already
>> addressed that point, neat. :-)
>
> Since my last message I managed to remove Guix as dependency completely.
> Right now it loads ‘(guix swh)’ opportunistically, but I might just copy
> the code in. Directory references now support multiple “addresses” so
> that you could have Nix-style, SWH-style, IPFS-style, etc. Hopefully my
> next message will have a WIP patch enabling Guix to use Disarchive!

Neat, looking forward to it!

Thank you,
Ludo’.
Z
Z
zimoun wrote on 26 Aug 2020 12:04
86blixyb7c.fsf@gmail.com
Dear Timothy,

On Thu, 30 Jul 2020 at 13:36, Timothy Sample <samplet@ngyro.com> wrote:

Toggle quote (16 lines)
> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>. It has a simple
> command-line interface so you can do
>
> $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable. Next, you can run
>
> $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Really nice! Thank you!


Toggle quote (8 lines)
>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!). A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above. The nice thing is that the Git repo itself
>> could be archived by SWH. :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>? :)

[...]

Toggle quote (7 lines)
> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly. I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process. The metadata directory
> ended up being 13M and the directory cache 2G.

One question is how this database scales?

For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
for ~14k packages and then an increase of ~700MB per year, both with the
Ludo’s code [1].




Toggle quote (9 lines)
> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc. Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo. I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work. Other stuff like changing the serialization format
> to JSON would be pretty easy, too. I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

This service could be really useful. Yes, it could be easy to update
the database each time Guix produces a new “sources.json”.

As mentioned [2], should this service be part of SWH (download cooking
task)? Or project side?



Thank you again for this piece for work.

All the best,
simon
T
T
Timothy Sample wrote on 26 Aug 2020 23:11
(name . zimoun)(address . zimon.toutoune@gmail.com)
87k0xlaz8p.fsf@ngyro.com
Hi zimoun,

zimoun <zimon.toutoune@gmail.com> writes:

Toggle quote (8 lines)
> One question is how this database scales?
>
> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> for ~14k packages and then an increase of ~700MB per year, both with the
> Ludo’s code [1].
>
> [1] <http://issues.guix.gnu.org/issue/42162#11>

It’s a good question. A good part of the size comes from the
representation rather than the data. Compression helps a lot here. I
have a database of 3,912 packages. It’s 295M uncompressed (which is a
little better than your estimation). If I pass each file through Lzip,
it shrinks down to 60M. That’s more like 15.5K per package, which is
almost an order of magnitude smaller than the estimation you used
(120K). I think that makes the numbers rather pleasant, but it comes at
the expense of easy storing in Git.

Toggle quote (5 lines)
> As mentioned [2], should this service be part of SWH (download cooking
> task)? Or project side?
>
> [2] <https://forge.softwareheritage.org/T2430#47486>

It would be interesting to just have SWH absorb the project. Since
other distros already know how to produce a “sources.json” and how to
query the SWH archive, it would mean that they benefit for free (and so
would Guix, for that matter). I’m open to that, but right now having
the freedom to experiment is important.


-- Tim
Z
Z
zimoun wrote on 27 Aug 2020 11:41
(name . Timothy Sample)(address . samplet@ngyro.com)
86lfi0e88r.fsf@gmail.com
Hi,

On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
Toggle quote (19 lines)
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question. A good part of the size comes from the
> representation rather than the data. Compression helps a lot here. I
> have a database of 3,912 packages. It’s 295M uncompressed (which is a
> little better than your estimation). If I pass each file through Lzip,
> it shrinks down to 60M. That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K). I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers. Really interesting!

First, I do not know if the database needs to be stored with Git. What
should be the advantage? (naive question :-))


On SWH T2430 [1], you explain the “default-header” trick to cut down the
size. Nice!

Moreover, the format is a long list, e.g.,

Toggle snippet (19 lines)
(headers
((name "raptor2-2.0.15/")
(mode 493)
(mtime 1414909500)
(chksum 4225)
(typeflag 53))
((name "raptor2-2.0.15/build/")
(mode 493)
(mtime 1414909497)
(chksum 4797)
(typeflag 53))
((name "raptor2-2.0.15/build/ltversion.m4")
(size 690)
(mtime 1414908273)
(chksum 5958))

[…])

which is human-readable. Is it useful?


Instead, one could imagine shorter keywords:

((na "raptor2-2.0.15/")
(mo 493)
(mt 1414909500)
(ch 4225)
(ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

(\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
(\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the “type” of list to ease the reader.


Well, the 2 naive questions are: does it make sense to
- have the database stored under Git?
- have an human-readable format?


Thank you again for pushing forward this topic. :-)

All the best,
simon

L
L
Ludovic Courtès wrote on 27 Aug 2020 14:49
(name . zimoun)(address . zimon.toutoune@gmail.com)
87lfi0tfrk.fsf@gnu.org
Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (33 lines)
> Moreover, the format is a long list, e.g.,
>
> (headers
> ((name "raptor2-2.0.15/")
> (mode 493)
> (mtime 1414909500)
> (chksum 4225)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/")
> (mode 493)
> (mtime 1414909497)
> (chksum 4797)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/ltversion.m4")
> (size 690)
> (mtime 1414908273)
> (chksum 5958))
>
> […])
>
> which is human-readable. Is it useful?
>
>
> Instead, one could imagine shorter keywords:
>
> ((na "raptor2-2.0.15/")
> (mo 493)
> (mt 1414909500)
> (ch 4225)
> (ty 53))
>
> which using your database (commit fc50927) reduces from 295MB to 279MB.

I think it’s nice, at least at this stage, that it’s
human-readable—“premature optimization is the root of all evil”. :-)

I guess it won’t be difficult to make the format more dense eventually
if that is deemed necessary, using ‘write’ instead of ‘pretty-print’,
using tricks like you write, or even going binary as a last resort.

Ludo’.
B
B
Bengt Richter wrote on 27 Aug 2020 20:06
(name . zimoun)(address . zimon.toutoune@gmail.com)
20200827180651.GA3255@LionPure
Hi,

On +2020-08-27 11:41:24 +0200, zimoun wrote:
Toggle quote (36 lines)
> Hi,
>
> On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> > zimoun <zimon.toutoune@gmail.com> writes:
> >
> >> One question is how this database scales?
> >>
> >> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
> >> for ~14k packages and then an increase of ~700MB per year, both with the
> >> Ludo’s code [1].
> >>
> >> [1] <http://issues.guix.gnu.org/issue/42162#11>
> >
> > It’s a good question. A good part of the size comes from the
> > representation rather than the data. Compression helps a lot here. I
> > have a database of 3,912 packages. It’s 295M uncompressed (which is a
> > little better than your estimation). If I pass each file through Lzip,
> > it shrinks down to 60M. That’s more like 15.5K per package, which is
> > almost an order of magnitude smaller than the estimation you used
> > (120K). I think that makes the numbers rather pleasant, but it comes at
> > the expense of easy storing in Git.
>
> Thank you for these numbers. Really interesting!
>
> First, I do not know if the database needs to be stored with Git. What
> should be the advantage? (naive question :-))
>
>
> On SWH T2430 [1], you explain the “default-header” trick to cut down the
> size. Nice!
>
> Moreover, the format is a long list, e.g.,
>
> --8<---------------cut here---------------start------------->8---
> (headers

How about
(X-v1-headers
(borrowing from rfc2045 MIME usage indicating as-yet-not-a-formal-standard)
The idea is to make it easy to script the change to "(headers" once
there is consensus for declaring a new standard. The "v1-" part could allow
a simultaneous "(X-v2-headers" alternative for zimoun's concise suggestion,
or even a base64 of a compressed format. There's lots that could be borrowed from
the MIME rfc's :)

Toggle snippet (17 lines)
6.3. New Content-Transfer-Encodings

Implementors may, if necessary, define private Content-Transfer-
Encoding values, but must use an x-token, which is a name prefixed by
"X-", to indicate its non-standard status, e.g., "Content-Transfer-
Encoding: x-my-new-encoding". Additional standardized Content-
Transfer-Encoding values must be specified by a standards-track RFC.
The requirements such specifications must meet are given in RFC 2048.
As such, all content-transfer-encoding namespace except that
beginning with "X-" is explicitly reserved to the IETF for future
use.

Unlike media types and subtypes, the creation of new Content-
Transfer-Encoding values is STRONGLY discouraged, as it seems likely
to hinder interoperability with little potential benefit

Toggle quote (2 lines)
> ((name "raptor2-2.0.15/")
> (mode 493)
If you want to be more human-readable with mode, I would put
a chmod argument in place of 493 :)

Toggle snippet (5 lines)
$ printf "%o\n" 493
755
$

Hm, could this be a security risk??
I mean, could a mode typo here inadvertently open a door for a nasty mod
by oportunistic code buried in a later-executed apparently unrelated app?

Toggle quote (1 lines)
> (mtime 1414909500)
One of these might be more human-recognizable :)
Toggle snippet (13 lines)
$ date --date='@1414909497' -Is
2014-11-02T07:24:57+01:00
$ date --date='@1414909497' -uIs
2014-11-02T06:24:57+00:00
$ TZ=America/Buenos_Aires date --date='@1414909497' -Is
2014-11-02T03:24:57-03:00
$
$ date --date='@1414909497' -u '+%Y%m%d_%H%M%S'
20141102_062457
# vs 1414909497, which, yes, costs 5 chars less
$

Toggle quote (20 lines)
> (chksum 4225)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/")
> (mode 493)
> (mtime 1414909497)
> (chksum 4797)
> (typeflag 53))
> ((name "raptor2-2.0.15/build/ltversion.m4")
> (size 690)
> (mtime 1414908273)
> (chksum 5958))
>
> […])
> --8<---------------cut here---------------end--------------->8---
>
> which is human-readable. Is it useful?
>
>
> Instead, one could imagine shorter keywords:
>
(X-v2-headers ;; ;-)
Toggle quote (10 lines)
> ((na "raptor2-2.0.15/")
> (mo 493)
> (mt 1414909500)
> (ch 4225)
> (ty 53))
>
> which using your database (commit fc50927) reduces from 295MB to 279MB.
>
> Or even plain list:
>
(X-v3-headers
Toggle quote (21 lines)
> (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
> (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
>
> where the first element provides the “type” of list to ease the reader.
>
>
> Well, the 2 naive questions are: does it make sense to
> - have the database stored under Git?
> - have an human-readable format?
>
>
> Thank you again for pushing forward this topic. :-)
>
> All the best,
> simon
>
> [1] https://forge.softwareheritage.org/T2430#47522
>
>
>

Prefixing "X-" can obviously be used with any tentative name for anything.

I am suggesting it as a counter to premature (and likely clashing) bindings
of valuable names, which IMO is as bad as premature optimization :)

Naming is too important to be defined by first-user flag-planting, ISTM.
--
Regards,
Bengt Richter
L
L
Ludovic Courtès wrote on 3 Nov 2020 15:26
(name . Timothy Sample)(address . samplet@ngyro.com)
87pn4ucy94.fsf@gnu.org
Hi Timothy,

I hope you’re well. I was wondering if you’ve had the chance to fiddle
with Disarchive since the summer?

I’m thinking there are small steps we could take to move forward:

1. Have a Disarchive package in Guix (and one for guile-quickcheck,
kudos on that one!).

2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
the disarchive-db.

3. Integrate Disarchive in (guix download) to reconstruct tarballs.

WDYT?

Thanks,
Ludo’, who’s still very much excited about these perspectives!
Z
Z
zimoun wrote on 3 Nov 2020 17:37
(name . Ludovic Courtès)(address . ludo@gnu.org)
CAJ3okZ3SgUPYvvWvGyM=sWf1X2ik_GdyJPMxhSrPUpY+v82dbw@mail.gmail.com
Hi,

On Tue, 3 Nov 2020 at 15:26, Ludovic Courtès <ludo@gnu.org> wrote:

Toggle quote (3 lines)
> 2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
> the disarchive-db.

One question is: how does the database scale? And only the real world
can show it. ;-)


Toggle quote (2 lines)
> Ludo’, who’s still very much excited about these perspectives!

Sounds awesome!

On my side, I asked twice on #swh-devel if it is possible to setup a
higher rate limit for one specific machine. I have in mind one
machine located at my place (Univ. Paris, ex Paris 7 Diderot) because
of proximity and because I want to generate (script) some report about
how much Guix is in SWH. Whatever!
Instead, we could ask for Berlin or for one machine of INRIA Bordeaux,
maybe the machine running guix.gnu.org or the one running
hpc.guix.info. WDYT?

BTW, not related to tarballs and I have not worked so much on (running
out of time), but I would like to integrate hg-fetch and svn-fetch
with SWH, first to "guix lint -c archival" then second to
sources.json. The save seems not so hard, but the lookup needs some
experiments with the SWH API.
The big picture is to have all the ingestion of the Guix packages done
by the automatically generated sources.json file and not via
time-to-time "guix lint -c archival" (should be recommended for custom
channels).


All the best,
simon
T
T
Timothy Sample wrote on 3 Nov 2020 20:20
(name . Ludovic Courtès)(address . ludo@gnu.org)
87y2jint6k.fsf@ngyro.com
Hi Ludo,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (3 lines)
> I hope you’re well. I was wondering if you’ve had the chance to fiddle
> with Disarchive since the summer?

Sort of! I managed to get the entire corpus of tarballs that I started
with to work (about 4000 archives). After that, I started writing some
documentation. The goal there was to be more careful with serialization
format. Starting to think clearly about the format and how to ensure
long-term compatibility gave me a bit of vertigo, so I took a break. :)

I was kind of hoping the initial excitement at SWH would push the
project along, but that seems to have died down (for now). Going back
to making sure it works for Guix is probably the best way to develop it
until I hear more from SWH.

Toggle quote (5 lines)
> I’m thinking there are small steps we could take to move forward:
>
> 1. Have a Disarchive package in Guix (and one for guile-quickcheck,
> kudos on that one!).

This will be easy. The hang-up I had earlier was that I vendored the
pristine-tar Gzip utility (“zgz”). Since then I don’t think it’s such a
big deal.

(I wrote Guile-QuickCheck ages ago! It was rotting away on my disk
because I couldn’t figure out a good way to use it with, say, Gash. It
has exposed several Disarchive bugs already.)

Toggle quote (3 lines)
> 2. Have a Cuirass job running on ci.guix.gnu.org to build and publish
> the disarchive-db.

I’m interested in running Cuirass locally for other reasons, so I should
have a good test environment to figure this out. To be honest, I’ve had
trouble figuring out Cuirass in the past, so I was dragging my feet a
bit.

Toggle quote (2 lines)
> 3. Integrate Disarchive in (guix download) to reconstruct tarballs.

I had a very simple patch that did this! It was less exciting when it
sounded like SWH was going to use Disarchive directly. However, like I
wrote, making Disarchive work for Guix is probably the best way to make
it work for SWH if they want it in the future.

Toggle quote (2 lines)
> WDYT?

This all will have to wait in the queue for a bit longer, but I should
be able to return to it soon. I think the steps listed above are good,
along with some changes I want to make to Disarchive itself.


--Tim
L
L
Ludovic Courtès wrote on 4 Nov 2020 17:49
(name . Timothy Sample)(address . samplet@ngyro.com)
87eel983u0.fsf@gnu.org
Hello!

Timothy Sample <samplet@ngyro.com> skribis:

Toggle quote (16 lines)
> Ludovic Courtès <ludo@gnu.org> writes:
>
>> I hope you’re well. I was wondering if you’ve had the chance to fiddle
>> with Disarchive since the summer?
>
> Sort of! I managed to get the entire corpus of tarballs that I started
> with to work (about 4000 archives). After that, I started writing some
> documentation. The goal there was to be more careful with serialization
> format. Starting to think clearly about the format and how to ensure
> long-term compatibility gave me a bit of vertigo, so I took a break. :)
>
> I was kind of hoping the initial excitement at SWH would push the
> project along, but that seems to have died down (for now). Going back
> to making sure it works for Guix is probably the best way to develop it
> until I hear more from SWH.

Yeah, I suppose they have enough on their plate and won’t add it to
their agenda until we have shown that it works for us.

Toggle quote (9 lines)
>> I’m thinking there are small steps we could take to move forward:
>>
>> 1. Have a Disarchive package in Guix (and one for guile-quickcheck,
>> kudos on that one!).
>
> This will be easy. The hang-up I had earlier was that I vendored the
> pristine-tar Gzip utility (“zgz”). Since then I don’t think it’s such a
> big deal.

Yeah.

Toggle quote (4 lines)
> (I wrote Guile-QuickCheck ages ago! It was rotting away on my disk
> because I couldn’t figure out a good way to use it with, say, Gash. It
> has exposed several Disarchive bugs already.)

Neat! I’m sure many of us would love to use it. :-)

Toggle quote (4 lines)
> This all will have to wait in the queue for a bit longer, but I should
> be able to return to it soon. I think the steps listed above are good,
> along with some changes I want to make to Disarchive itself.

Alright! Let us know if you think there are tasks that people should
just pick and work on in the meantime.

Thanks for the prompt reply!

Ludo’.
M
M
Maxim Cournoyer wrote on 10 Jan 2021 20:32
Re: bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
87y2h04mhb.fsf@gmail.com
Hello Ludovic,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

Toggle quote (40 lines)
> Hello!
>
> The hosting site gforge.inria.fr will be taken off-line in December
> 2020. This GForge instance hosts source code as tarballs, Subversion
> repos, and Git repos. Users have been invited to migrate to
> gitlab.inria.fr, which is Git only. It seems that Software Heritage
> hasn’t archived (yet) all of gforge.inria.fr. Let’s keep track of the
> situation in this issue.
>
> The following packages have their source on gforge.inria.fr:
>
> scheme@(guile-user)> ,pp packages-on-gforge
> $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
> #<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f632ee3adc0>
> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
> #<package gf2x@1.2 gnu/packages/algebra.scm:103 7f6323ea1280>
> #<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:658 7f6323eb4960>
> #<package cmh@1.0 gnu/packages/algebra.scm:322 7f6323eb4dc0>)
>
>
> ‘isl’ (a dependency of GCC) has its source on gforge.inria.fr but it’s
> also mirrored at gcc.gnu.org apparently.
>
> Of these, the following are available on Software Heritage:
>
> scheme@(guile-user)> ,pp archived-source
> $8 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>
> #<package pt-scotch@6.0.6 gnu/packages/maths.scm:2920 7f632d832640>
> #<package scotch@6.0.6 gnu/packages/maths.scm:2774 7f632d832780>
> #<package pt-scotch32@6.0.6 gnu/packages/maths.scm:2944 7f632d8325a0>
> #<package scotch32@6.0.6 gnu/packages/maths.scm:2873 7f632d8326e0>
> #<package isl@0.18 gnu/packages/gcc.scm:925 7f632dc82320>
> #<package isl@0.11.1 gnu/packages/gcc.scm:939 7f632dc82280>)

I ran the code you had attached to the original message and got:

,pp packages-on-gforge
$2 = ()
scheme@(guile-user)> ,pp archived-source
$3 = ()

Closing,

Thank you.

Maxim
Closed
L
L
Ludovic Courtès wrote on 13 Jan 2021 11:39
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87a6tdce94.fsf@inria.fr
Hi Maxim,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (7 lines)
>> The following packages have their source on gforge.inria.fr:
>>
>> scheme@(guile-user)> ,pp packages-on-gforge
>> $7 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3931 7f632401a640>
>> #<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:295 7f63235eb3c0>
>> #<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:357 7f63235eb280>

[...]

Toggle quote (7 lines)
> I ran the code you had attached to the original message and got:
>
> ,pp packages-on-gforge
> $2 = ()
> scheme@(guile-user)> ,pp archived-source
> $3 = ()

Oh, it’s due to a bug, where the wrong ‘origin?’ predicate was taken.
After hiding the “wrong” one:

#:use-module ((guix swh) #:hide (origin?))

I get:

Toggle snippet (26 lines)
scheme@(guix-user)> ,pp packages-on-gforge
$1 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:3964 7fa8a522b280>
#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80>
#<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7fa8afd8aa00>
#<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20>
#<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0>
#<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80>
#<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40>
#<package isl@0.22.1 gnu/packages/gcc.scm:932 7fa8a64cbdc0>
#<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80>
#<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20>
#<package gf2x@1.2 gnu/packages/algebra.scm:104 7fa8a4f66500>
#<package gmp-ecm@7.0.4 gnu/packages/algebra.scm:672 7fa8a4f70be0>
#<package cmh@1.0 gnu/packages/algebra.scm:325 7fa8a4f660a0>)
scheme@(guix-user)> ,pp archived-source
$2 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:281 7fa8a4f44dc0>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:343 7fa8a4f44c80>
#<package scotch@6.1.0 gnu/packages/maths.scm:3083 7fa8a69c8d20>
#<package pt-scotch@6.1.0 gnu/packages/maths.scm:3229 7fa8a69c8be0>
#<package scotch32@6.1.0 gnu/packages/maths.scm:3182 7fa8a69c8c80>
#<package pt-scotch32@6.1.0 gnu/packages/maths.scm:3253 7fa8a69c8b40>
#<package isl@0.11.1 gnu/packages/gcc.scm:997 7fa8a64cbc80>
#<package isl@0.18 gnu/packages/gcc.scm:983 7fa8a64cbd20>)

Attaching the fixed script for clarity.

BTW, gforge.inria.fr shutdown has been delayed a bit, but most active
projects have started migrating to gitlab.inria.fr or elsewhere, so
hopefully we should be able to start updating our package recipes
accordingly. It’s likely, though, that tarballs were lost in the
migration.

For example, Scotch is now at https://gitlab.inria.fr/scotch/scotch.
the 6.1.0 release, but these are auto-generated tarballs instead of the
handcrafted one found on gforge.inria.fr (but this one is fine since its
tarball is archived as-is on SWH.)

ISL, MPFI, and GMP-ECM haven’t migrated, it seems. CMH is now at
https://gitlab.inria.fr/cmh/cmh but without its tarballs.

Andreas, do you happen to know about the status of these?

We can already change Scotch and CMH to ‘git-fetch’ I think. That
doesn’t solve the problem for earlier Guix revisions though, and I hope
Disarchive will save us!

Thanks,
Ludo’.
(use-modules (guix) (gnu) (guix svn-download) (guix git-download) ((guix swh) #:hide (origin?)) (ice-9 match) (srfi srfi-1) (srfi srfi-26)) (define (gforge? package) (define (gforge-string? str) (string-contains str "gforge.inria.fr")) (match (package-source package) ((? origin? o) (match (origin-uri o) ((? string? url) (gforge-string? url)) (((? string? urls) ...) (any gforge-string? urls)) ;or 'find' ((? git-reference? ref) (gforge-string? (git-reference-url ref))) ((? svn-reference? ref) (gforge-string? (svn-reference-url ref))) (_ #f))) (_ #f))) (define packages-on-gforge (fold-packages (lambda (package result) (if (gforge? package) (cons package result) result)) '())) (define archived-source (filter (lambda (package) (let* ((origin (package-source package)) (hash (origin-hash origin))) (lookup-content (content-hash-value hash) (symbol->string (content-hash-algorithm hash))))) packages-on-gforge))
Closed
A
A
Andreas Enge wrote on 13 Jan 2021 13:27
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
X/7nQk1sj4TYaC7t@jurong
Hello,

Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès:
Toggle quote (5 lines)
> ISL, MPFI, and GMP-ECM haven’t migrated, it seems. CMH is now at
> <https://gitlab.inria.fr/cmh/cmh> but without its tarballs.
>
> Andreas, do you happen to know about the status of these?

For CMH, the tarballs are available from its (new) homepage:
I can update the location at the next release, which I should prepare
some time soon (TM).

Concerning MPFI and GMP-ECM, I can ask their respective authors to keep
me updated; I have no doubts they are going to migrate their projects.

For ISL, I do not know.

Andreas
L
L
Ludovic Courtès wrote on 13 Jan 2021 15:28
(address . 42162@debbugs.gnu.org)
87v9c0ap22.fsf_-_@gnu.org
help-debbugs@gnu.org (GNU bug Tracking System) skribis:

Toggle quote (2 lines)
> We can already change Scotch and CMH to ‘git-fetch’ I think.

For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different
content than the tarball on gforge:
Nur en /tmp/scotch_6.1.0/: bin
Nur en /tmp/scotch_6.1.0/doc/src/ptscotch: p.ps
Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitignore
Nur en /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout: .gitlab-ci.yml
Nur en /tmp/scotch_6.1.0/: include
Nur en /tmp/scotch_6.1.0/: lib
diff -ru /tmp/scotch_6.1.0/src/libscotch/library.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h
--- /tmp/scotch_6.1.0/src/libscotch/library.h 1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/library.h 1970-01-01 01:00:01.000000000 +0100
@@ -67,8 +67,6 @@
/*+ Integer type. +*/
-#include <stdint.h>
-
typedef DUMMYIDX SCOTCH_Idx;
typedef DUMMYINT SCOTCH_Num;
diff -ru /tmp/scotch_6.1.0/src/libscotch/Makefile /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile
--- /tmp/scotch_6.1.0/src/libscotch/Makefile 1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotch/Makefile 1970-01-01 01:00:01.000000000 +0100
@@ -2320,28 +2320,6 @@
common.h \
scotch.h
-library_graph_diam$(OBJ) : library_graph_diam.c \
- module.h \
- common.h \
- graph.h \
- scotch.h
-
-library_graph_diam_f$(OBJ) : library_graph_diam.c \
- module.h \
- common.h \
- scotch.h
-
-library_graph_induce$(OBJ) : library_graph_diam.c \
- module.h \
- common.h \
- graph.h \
- scotch.h
-
-library_graph_induce_f$(OBJ) : library_graph_diam.c \
- module.h \
- common.h \
- scotch.h
-
library_graph_io_chac$(OBJ) : library_graph_io_chac.c \
module.h \
common.h \
diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h
--- /tmp/scotch_6.1.0/src/libscotchmetis/library_metis.h 1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_metis.h 1970-01-01 01:00:01.000000000 +0100
@@ -106,7 +106,6 @@
*/
#ifndef SCOTCH_H /* In case "scotch.h" not included before */
-#include <stdint.h>
typedef DUMMYINT SCOTCH_Num;
#endif /* SCOTCH_H */
diff -ru /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h
--- /tmp/scotch_6.1.0/src/libscotchmetis/library_parmetis.h 1970-01-01 01:00:01.000000000 +0100
+++ /gnu/store/h84nd9h3131l63y4rllvzpnk6q0dsaq2-scotch-6.1.0-checkout/src/libscotchmetis/library_parmetis.h 1970-01-01 01:00:01.000000000 +0100
@@ -106,7 +106,6 @@
*/
#ifndef SCOTCH_H /* In case "scotch.h" not included before */
-#include <stdint.h>
typedef DUMMYINT SCOTCH_Num;
#endif /* SCOTCH_H */
There’s not much we can do if upstream isn’t more cautious though.
Perhaps we can still update to the “new” 6.1.0, maybe labeling it
“6.1.0b”?

Attached a tentative patch.

Thanks,
Ludo’.
Toggle diff (35 lines)
diff --git a/gnu/packages/maths.scm b/gnu/packages/maths.scm
index 7866bcc6eb..4f8f79052d 100644
--- a/gnu/packages/maths.scm
+++ b/gnu/packages/maths.scm
@@ -12,7 +12,7 @@
 ;;; Copyright © 2015 Fabian Harfert <fhmgufs@web.de>
 ;;; Copyright © 2016 Roel Janssen <roel@gnu.org>
 ;;; Copyright © 2016, 2018, 2020 Kei Kebreau <kkebreau@posteo.net>
-;;; Copyright © 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo@gnu.org>
+;;; Copyright © 2016, 2017, 2018, 2019, 2020, 2021 Ludovic Courtès <ludo@gnu.org>
 ;;; Copyright © 2016 Leo Famulari <leo@famulari.name>
 ;;; Copyright © 2016, 2017 Thomas Danckaert <post@thomasdanckaert.be>
 ;;; Copyright © 2017, 2018, 2019, 2020 Paul Garlick <pgarlick@tourbillion-technology.com>
@@ -3083,13 +3083,15 @@ implemented in ANSI C, and MPI for communications.")
   (package
     (name "scotch")
     (version "6.1.0")
-    (source
-     (origin
-      (method url-fetch)
-      (uri (string-append "https://gforge.inria.fr/frs/download.php/"
-                          "latestfile/298/scotch_" version ".tar.gz"))
+    (source (origin
+              (method git-fetch)
+              (uri (git-reference
+                    (url "https://gitlab.inria.fr/scotch/scotch")
+                    (commit (string-append "v" version))))
+              (file-name (git-file-name name version))
               (sha256
-       (base32 "1184fcv4wa2df8szb5lan6pjh0raarr45pk8ilpvbz23naikzg53"))
+               (base32
+                "164jqsy75j7zfnwngj10jc4060shhxni3z8ykklhqjykdrinir55"))
               (patches (search-patches "scotch-build-parallelism.patch"
                                        "scotch-integer-declarations.patch"))))
     (build-system gnu-build-system)
A
A
Andreas Enge wrote on 13 Jan 2021 16:07
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
X/8MmmIkoVubIZwe@jurong
Am Wed, Jan 13, 2021 at 11:39:19AM +0100 schrieb Ludovic Courtès:
Toggle quote (2 lines)
> ISL, MPFI, and GMP-ECM haven’t migrated, it seems.

gmp-ecm has migrated to gitlab.inria.fr; I just pushed a commit with an
updated URI. Besides the automatically created gitlab releases with git
snapshots, the maintainer also uploads a release tarball. I chose to use
the latter, which requires to manually update a hash together with the
version number upon a new release.

Andreas
M
M
Maxim Cournoyer wrote on 14 Jan 2021 15:21
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
871renmwem.fsf@gmail.com
Hi Ludovic,

Ludovic Courtès <ludovic.courtes@inria.fr> writes:

[...]

Toggle quote (4 lines)
> There’s not much we can do if upstream isn’t more cautious though.
> Perhaps we can still update to the “new” 6.1.0, maybe labeling it
> “6.1.0b”?

I'd prefer to append a '-1' revision rather than changing the version
string itself; as that is IMO the business of upstream.

Thanks,

Maxim
L
L
Ludovic Courtès wrote on 4 Oct 2021 17:59
gforge.inria.fr is off-line
(address . 42162@debbugs.gnu.org)
87wnmsn5lz.fsf_-_@gnu.org
Hi!

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (7 lines)
> help-debbugs@gnu.org (GNU bug Tracking System) skribis:
>
>> We can already change Scotch and CMH to ‘git-fetch’ I think.
>
> For Scotch, the ‘v6.1.0’ tag at gitlab.inria.fr provides different
> content than the tarball on gforge:

[...]

Toggle quote (6 lines)
> There’s not much we can do if upstream isn’t more cautious though.
> Perhaps we can still update to the “new” 6.1.0, maybe labeling it
> “6.1.0b”?
>
> Attached a tentative patch.

Believe it or not, gforge.inria.fr was finally phased out on
Sept. 30th. And believe it or not, despite all the work and all the
chat :-), we lost the source tarball of Scotch 6.1.1 for a short period
of time (I found a copy and uploaded it to berlin a couple of hours
ago).

Going back to the script at the beginning of this bug report, we get (on
688a4db071736a772e6b5515d7c03fe501c3c15a):

Toggle snippet (37 lines)
scheme@(guile-user)> ,pp packages-on-gforge
$2 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630>
#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0>
#<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160>
#<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60>
#<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00>
#<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50>
#<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10>
#<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70>
#<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0>
#<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0>
#<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0>
#<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60>
#<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00>
#<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00>
#<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>)
scheme@(guile-user)> ,pp archived-source
$3 = (#<package ocaml-cudf@0.9 gnu/packages/ocaml.scm:566 7f088675c630>
#<package ocaml-dose3@5.0.1 gnu/packages/ocaml.scm:628 7f088675c4d0>
#<package isl@0.18 gnu/packages/gcc.scm:1103 7f088161cbb0>
#<package isl@0.11.1 gnu/packages/gcc.scm:1117 7f088161cb00>)
scheme@(guile-user)> ,pp (lset-difference eq? packages-on-gforge archived-source)
$4 = (#<package r-spams@2.6-2017-03-22 gnu/packages/statistics.scm:4357 7f08823d8630>
#<package mpfi@1.5.4 gnu/packages/multiprecision.scm:158 7f0881609160>
#<package scotch-shared@6.1.1 gnu/packages/maths.scm:3732 7f0882964c60>
#<package pt-scotch32@6.1.1 gnu/packages/maths.scm:3814 7f0882964b00>
#<package pt-scotch-shared@6.1.1 gnu/packages/maths.scm:3837 7f0882964a50>
#<package scotch32@6.1.1 gnu/packages/maths.scm:3684 7f0882964d10>
#<package why3@1.3.3 gnu/packages/maths.scm:6904 7f0882357e70>
#<package pt-scotch@6.1.1 gnu/packages/maths.scm:3790 7f0882964bb0>
#<package scotch@6.1.1 gnu/packages/maths.scm:3581 7f0882964dc0>
#<package isl@0.22.1 gnu/packages/gcc.scm:1052 7f088161cc60>
#<package gf2x@1.2 gnu/packages/algebra.scm:107 7f0880397b00>
#<package gappa@1.3.5 gnu/packages/algebra.scm:1273 7f08803a76e0>)

All this to say that we must really get our act together with Disarchive
:-), and salvage all these tarballs until then.

There are redirects in place for some of these, but probably not all.

Ludo’.
Z
Z
zimoun wrote on 4 Oct 2021 19:50
Re: bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
87bl44vfvg.fsf_-_@gmail.com
Hi Ludo,

On Mon, 04 Oct 2021 at 17:59, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

Toggle quote (6 lines)
> Believe it or not, gforge.inria.fr was finally phased out on
> Sept. 30th. And believe it or not, despite all the work and all the
> chat :-), we lost the source tarball of Scotch 6.1.1 for a short period
> of time (I found a copy and uploaded it to berlin a couple of hours
> ago).

Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020,
Scotch was not missing. And from [2] neither.

Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1)
without manually taking care of this bug report; by switching from
url-fetch to git-fetch for instance. Somehow, it was bounded to happen
because we lack automatic tools despite the fact they are there.

Indeed, hard to believe. :-)

As I am asking in this thread [3], the Guix project has the ressource,
storage speaking, to archive these tarballs -- waiting a robust
long-term automatic system. But we (the Guix projet) cannot because we
duplicate the effort on keeping twice all the build outputs. Somehow,
between Berlin and Bordeaux, coherent policies for conservancy are
missing. IMHO.



Toggle quote (3 lines)
> All this to say that we must really get our act together with Disarchive
> :-), and salvage all these tarballs until then.

Definetly! We are witnesssing missing tarballs here. But many more
could be missing from Berlin or Bordeaux and also upstream should have
disappeared.


Cheers,
simon
L
L
Ludovic Courtès wrote on 7 Oct 2021 18:07
(name . zimoun)(address . zimon.toutoune@gmail.com)
87o880byyz.fsf@inria.fr
Hi!

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (10 lines)
> Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020,
> Scotch was not missing. And from [2] neither.
>
> Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then 6.1.1)
> without manually taking care of this bug report; by switching from
> url-fetch to git-fetch for instance. Somehow, it was bounded to happen
> because we lack automatic tools despite the fact they are there.
>
> Indeed, hard to believe. :-)

I guess, in our mind, the problem was fixed long ago. :-)

Toggle quote (7 lines)
> As I am asking in this thread [3], the Guix project has the ressource,
> storage speaking, to archive these tarballs -- waiting a robust
> long-term automatic system. But we (the Guix projet) cannot because we
> duplicate the effort on keeping twice all the build outputs. Somehow,
> between Berlin and Bordeaux, coherent policies for conservancy are
> missing. IMHO.

So I think we’re lucky that we can try different solutions at once.

The best solution is the one that won’t rely solely on the Guix project:
SWH + Disarchive. We’re getting there!

The second-best solution is to improve our tooling so we can actually
keep source code in a more controlled way. That’s what I had in mind
with https://ci.guix.gnu.org/jobset/source. We have storage space for
that on berlin, but it’s not infinite.

Another approach is to use ‘git-fetch’ more, at least for non-Autotools
packages (that’s the case for Scotch, for instance.)

So we can do all these things, and we’ll have to push hard to get the
Disarchive option past the finish line because it’s the most promising
long-term.

Thanks,
Ludo’.
Z
Z
zimoun wrote on 11 Oct 2021 10:41
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
CAJ3okZ2WCpzAUgBGZ1JaJmKkEmjjpFfy8hkBD854CD9vLiDHSw@mail.gmail.com
Hi,

On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

Toggle quote (2 lines)
> I guess, in our mind, the problem was fixed long ago. :-)

Yes, to me the 2 remaining packages was from
http://issues.guix.gnu.org/43442#0 but moved already to Gitlab.
Whatever, :-)


Toggle quote (9 lines)
> > As I am asking in this thread [3], the Guix project has the ressource,
> > storage speaking, to archive these tarballs -- waiting a robust
> > long-term automatic system. But we (the Guix projet) cannot because we
> > duplicate the effort on keeping twice all the build outputs. Somehow,
> > between Berlin and Bordeaux, coherent policies for conservancy are
> > missing. IMHO.
>
> So I think we’re lucky that we can try different solutions at once.

Well, it is not what I am observing. Anyway. :-)


Toggle quote (3 lines)
> The best solution is the one that won’t rely solely on the Guix project:
> SWH + Disarchive. We’re getting there!

Yes. Although, it is hard to define "the Guix project". :-)
Well, the remaining question is where to set the Disarchive
database... but hardware could be floating around once it is ready.
;-)


Toggle quote (5 lines)
> The second-best solution is to improve our tooling so we can actually
> keep source code in a more controlled way. That’s what I had in mind
> with <https://ci.guix.gnu.org/jobset/source>. We have storage space for
> that on berlin, but it’s not infinite.

If Berlin has space, why so much derivations are missing when running
time-machine?

Well, aside the implementation that ci.guix.gnu.org fetches from repo
every X minutes, i.e., drops all the commits (and the associated
derivations) pushed in the meantime. And that bordeaux.guix.gnu.org
fetches from guix-commits the commit batch, i.e., builds only one
commit of this batch.


Toggle quote (3 lines)
> Another approach is to use ‘git-fetch’ more, at least for non-Autotools
> packages (that’s the case for Scotch, for instance.)

This is what I suggested when opening this thread [1] more than one
year ago. Reading the discussion and keeping in mind the inertia, I
do not think it is a viable path. For instance, you know all the
pitfalls and you updated Scotch without switching to git-fetch -- no
criticism :-) just a realistic matter of facts to have good coverage.



Toggle quote (4 lines)
> So we can do all these things, and we’ll have to push hard to get the
> Disarchive option past the finish line because it’s the most promising
> long-term.

Agree. Even, I think it is the only long-term option. :-)


All the best,
simon
L
L
Ludovic Courtès wrote on 12 Oct 2021 11:24
(name . zimoun)(address . zimon.toutoune@gmail.com)
87czoay4sq.fsf@inria.fr
Hello!

I sense a lot of impatience in your message :-), and I also see many
questions. It is up to us all to answer them, I’ll just reply
selectively here.

zimoun <zimon.toutoune@gmail.com> skribis:

Toggle quote (2 lines)
> On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

[...]

Toggle quote (8 lines)
>> The second-best solution is to improve our tooling so we can actually
>> keep source code in a more controlled way. That’s what I had in mind
>> with <https://ci.guix.gnu.org/jobset/source>. We have storage space for
>> that on berlin, but it’s not infinite.
>
> If Berlin has space, why so much derivations are missing when running
> time-machine?

That’s not related to the question at hand, but it would be worth
investigating, first by trying to quantify that.

For the record, the ‘guix publish’ config on berlin is here:


If I read that correctly, nars have a TTL of 180 days (this is the time
a nar is retained after the last time it has been requested, so it’s a
lower bound.)

Toggle quote (11 lines)
>> Another approach is to use ‘git-fetch’ more, at least for non-Autotools
>> packages (that’s the case for Scotch, for instance.)
>
> This is what I suggested when opening this thread [1] more than one
> year ago. Reading the discussion and keeping in mind the inertia, I
> do not think it is a viable path. For instance, you know all the
> pitfalls and you updated Scotch without switching to git-fetch -- no
> criticism :-) just a realistic matter of facts to have good coverage.
>
> <https://lists.gnu.org/archive/html/guix-devel/2020-05/msg00224.html>

Right, and I agree Scotch is a package that can definitely use
‘git-fetch’ (there are bootstrapping considerations of packages low in
the stack, for instance you wouldn’t want to have Git fetched over
‘git-fetch’, but for packages like this there’s no reason not to use
‘git-fetch’.)

Thanks,
Ludo’.
Z
Z
zimoun wrote on 12 Oct 2021 12:50
(name . Ludovic Courtès)(address . ludovic.courtes@inria.fr)
86h7dmms8c.fsf@gmail.com
Hi Ludo,

On Tue, 12 Oct 2021 at 11:24, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

Toggle quote (4 lines)
> I sense a lot of impatience in your message :-), and I also see many
> questions. It is up to us all to answer them, I’ll just reply
> selectively here.

Impatience? Probably. :-)


Toggle quote (16 lines)
> zimoun <zimon.toutoune@gmail.com> skribis:
>> On Thu, 7 Oct 2021 at 18:07, Ludovic Courtès <ludovic.courtes@inria.fr> wrote:
>
> [...]
>
>>> The second-best solution is to improve our tooling so we can actually
>>> keep source code in a more controlled way. That’s what I had in mind
>>> with <https://ci.guix.gnu.org/jobset/source>. We have storage space for
>>> that on berlin, but it’s not infinite.
>>
>> If Berlin has space, why so much derivations are missing when running
>> time-machine?
>
> That’s not related to the question at hand, but it would be worth
> investigating, first by trying to quantify that.

The question seems related. :-) Because you are saying “we have storage
space for that on Berlin”…

Toggle quote (8 lines)
> For the record, the ‘guix publish’ config on berlin is here:
>
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/modules/sysadmin/services.scm#n485
>
> If I read that correctly, nars have a TTL of 180 days (this is the time
> a nar is retained after the last time it has been requested, so it’s a
> lower bound.)

…and the NARs are more or less removed after 180 days if no one asked
for them during these 180 days, IIUC. This policy seems to keep under
control the size of the storage, I guess. And I provide an annoying
example of such policy. :-)

Anyway, I agree it is not, for now, the core of the question at
hand. :-)


About quantifying, it is clearly not related to the question at
hand. ;-)

Just for the record, a back to envelope computations. 180 days before
today was April 15th (M-x calendar C-u 180 C-b). It means 6996 commits
(35aaf1fe10 is my current last commit).

git log --format="%cd" --after=2021-04-15 | wc -l
6996

However, these commits are pushed by batch. Roughly, it reads:

git log --format="%cd" --after=2021-04-15 --date=unix \
| awk 'NR == 1{old= $1; next}{print old - $1; old = $1}' \
| sort -n | uniq -c | grep -e "0$" | head
1 -1542620
3388 0
14 10
6 20
5 30
2 40
4 50
1 60
2 70
2 80

(Take the ’awk’ with care, I am not sure of what I am doing. :-) And,
it is rough because timezone etc.)

Other said 3388/6996= ~50% of commits are pushed at the same time, i.e.,
missed by both build farms using 2 different strategies to collect the
thing to build (fetch every 5 minutes or fetch from guix-commits). It
is a quick back to envelope so keep that with some salt. :-)

On that number, after 180 days (6 months), it is hard to evaluate the
rate of the time-machine queries. And from my experience (no number to
back), running time-machine on a commit older than this 180 days implies
to build derivations. Or it is a lucky day. :-)

Drifting, right? Let focus on the question at hand. However, this
question of long-term policy asked at:


appears to me worth. :-)

Cheers,
simon
R
R
raingloom wrote on 9 Oct 2021 19:29
(address . bug-guix@gnu.org)
20211009192936.64a1ed01@riseup.net
On Thu, 07 Oct 2021 18:07:16 +0200
Ludovic Courtès <ludovic.courtes@inria.fr> wrote:

Toggle quote (37 lines)
> Hi!
>
> zimoun <zimon.toutoune@gmail.com> skribis:
>
> > Euh, I do not understand. From bug#43442 [1] on Wed, 16 Sep 2020,
> > Scotch was not missing. And from [2] neither.
> >
> > Nah, the hole is the (double) update (from 6.0.6 to 6.1.0 then
> > 6.1.1) without manually taking care of this bug report; by
> > switching from url-fetch to git-fetch for instance. Somehow, it
> > was bounded to happen because we lack automatic tools despite the
> > fact they are there.
> >
> > Indeed, hard to believe. :-)
>
> I guess, in our mind, the problem was fixed long ago. :-)
>
> > As I am asking in this thread [3], the Guix project has the
> > ressource, storage speaking, to archive these tarballs -- waiting a
> > robust long-term automatic system. But we (the Guix projet) cannot
> > because we duplicate the effort on keeping twice all the build
> > outputs. Somehow, between Berlin and Bordeaux, coherent policies
> > for conservancy are missing. IMHO.
>
> So I think we’re lucky that we can try different solutions at once.
>
> The best solution is the one that won’t rely solely on the Guix
> project: SWH + Disarchive. We’re getting there!
>
> The second-best solution is to improve our tooling so we can actually
> keep source code in a more controlled way. That’s what I had in mind
> with <https://ci.guix.gnu.org/jobset/source>. We have storage space
> for that on berlin, but it’s not infinite.
>
> Another approach is to use ‘git-fetch’ more, at least for
> non-Autotools packages (that’s the case for Scotch, for instance.)

Out of curiosity, why only non-autotools?
?