From debbugs-submit-bounces@debbugs.gnu.org Mon Jul 20 04:39:23 2020 Received: (at 42162) by debbugs.gnu.org; 20 Jul 2020 08:39:23 +0000 Received: from localhost ([127.0.0.1]:33685 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jxRKR-0001i9-JX for submit@debbugs.gnu.org; Mon, 20 Jul 2020 04:39:23 -0400 Received: from eggs.gnu.org ([209.51.188.92]:53034) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jxRKP-0001hs-Rs for 42162@debbugs.gnu.org; Mon, 20 Jul 2020 04:39:18 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:58466) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jxRKJ-0003ov-Mv; Mon, 20 Jul 2020 04:39:11 -0400 Received: from [2001:660:6102:320:e120:2c8f:8909:cdfe] (port=56700 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1jxRKG-00010O-Ak; Mon, 20 Jul 2020 04:39:09 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: zimoun Subject: Re: Recovering source tarballs References: <87mu4iv0gc.fsf@inria.fr> <86h7uq8fmk.fsf@gmail.com> <87d05etero.fsf@gnu.org> <87r1tit5j6.fsf_-_@gnu.org> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 3 Thermidor an 228 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 20 Jul 2020 10:39:06 +0200 In-Reply-To: (zimoun's message of "Wed, 15 Jul 2020 18:55:21 +0200") Message-ID: <87365mzil1.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 42162 Cc: 42162@debbugs.gnu.org, Maurice =?utf-8?Q?Br=C3=A9mond?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi! There are many many comments in your message, so I took the liberty to reply only to the essence of it. :-) zimoun skribis: > On Sat, 11 Jul 2020 at 17:50, Ludovic Court=C3=A8s wrote: > >> For the now, since 70% of our packages use =E2=80=98url-fetch=E2=80=99, = we need to be >> able to fetch or to reconstruct tarballs. There=E2=80=99s no way around= it. > > Yes, but for example all the packages in gnu/packages/bioconductor.scm > could be "git-fetch". Today the source is over url-fetch but it could > be over git-fetch with https://git.bioconductor.org/packages/flowCore or > git@git.bioconductor.org:packages/flowCore. > > Another example is the packages in gnu/packages/emacs-xyz.scm and the > ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for > example using > http://git.savannah.gnu.org/gitweb/?p=3Demacs/elpa.git;a=3Dtree;f=3Dpacka= ges/ace-window;h=3D71d3eb7bd2efceade91846a56b9937812f658bae;hb=3DHEAD > > So I would be more reserved about the "no way around it". :-) I mean > the 70% could be a bit mitigated. The =E2=80=9Cno way around it=E2=80=9D was about the situation today: it=E2= =80=99s a fact that 70% of packages are built from tarballs, so we need to be able to fetch them or reconstruct them. However, the two examples above are good ideas as to the way forward: we could start a url-fetch-to-git-fetch migration in these two cases, and perhaps more. >> In the short term, we should arrange so that the build farm keeps GC >> roots on source tarballs for an indefinite amount of time. Cuirass >> jobset? Mcron job to preserve GC roots? Ideas? > > Yes, preserving source tarballs for an indefinite amount of time will > help. At least all the packages where "lookup-content" returns #f, > which means they are not in SWH or they are unreachable -- both is > equivalent from Guix side. > > What about in addition push to IPFS? Feasible? Lookup issue? Lookup issue. :-) The hash in a CID is not just a raw blob hash. Files are typically chunked beforehand, assembled as a Merkle tree, and the CID is roughly the hash to the tree root. So it would seem we can=E2= =80=99t use IPFS as-is for tarballs. >> For the future, we could store nar hashes of unpacked tarballs instead >> of hashes over tarballs. But that raises two questions: >> >> =E2=80=A2 If we no longer deal with tarballs but upstreams keep signing >> tarballs (not raw directory hashes), how can we authenticate our >> code after the fact? > > Does Guix automatically authenticate code using signed tarballs? Not automatically; packagers are supposed to authenticate code when they add a package (=E2=80=98guix refresh -u=E2=80=99 does that automatically). >> =E2=80=A2 SWH internally store Git-tree hashes, not nar hashes, so we = still >> wouldn=E2=80=99t be able to fetch our unpacked trees from SWH. >> >> (Both issues were previously discussed at >> .) >> >> So for the medium term, and perhaps for the future, a possible option >> would be to preserve tarball metadata so we can reconstruct them: >> >> tarball =3D metadata + tree > > There is different issues at different levels: > > 1. how to lookup? what information do we need to keep/store to be able > to query SWH? > 2. how to check the integrity? what information do we need to > keep/store to be able to verify that SWH returns what Guix expects? > 3. how to authenticate? where the tarball metadata has to be stored if > SWH removes it? > > Basically, the git-fetch source stores 3 identifiers: > > - upstream url > - commit / tag > - integrity (sha256) > > Fetching from SWH requires the commit only (lookup-revision) or the > tag+url (lookup-origin-revision) then from the returned revision, the > integrity of the downloaded data is checked using the sha256, right? Yes. > Therefore, one way to fix lookup of the url-fetch source is to add an > extra field mimicking the commit role. But today, we store tarball hashes, not directory hashes. > The easiest is to store a SWHID or an identifier allowing to deduce the > SWHID. > > I have not checked the code, but something like this: > > https://pypi.org/project/swh.model/ > https://forge.softwareheritage.org/source/swh-model/ > > and at package time, this identifier is added, similarly to integrity. I=E2=80=99m skeptical about adding a field that is practically never used. [...] >> The code below can =E2=80=9Cdisassemble=E2=80=9D and =E2=80=9Cassemble= =E2=80=9D a tar. When it >> disassembles it, it generates metadata like this: > > [...] > >> The =E2=80=99assemble-archive=E2=80=99 procedure consumes that, looks up= file contents >> by hash on SWH, and reconstructs the original tarball=E2=80=A6 > > Where do you plan to store the "disassembled" metadata? > And where do you plan to "assemble-archive"? We=E2=80=99d have a repo/database containing metadata indexed by tarball sh= a256. > How this database that maps tarball hashes to metadata should be > maintained? Git push hook? Cron task? Yes, something like that. :-) > What about foreign channels? Should they maintain their own map? Yes, presumably. > To summary, it would work like this, right? > > at package time: > - store an integrity identiter (today sha256-nix-base32) > - disassemble the tarball > - commit to another repo the metadata using the path (address) > sha256/base32/ > - push to packages-repo *and* metadata-database-repo > > at future time: (upstream has disappeared, say!) > - use the integrity identifier to query the database repo > - lookup the SWHID from the database repo > - fetch the data from SWH > - or lookup the IPFS identifier from the database repo and fetch the > data from IPFS, for another example > - re-assemble the tarball using the metadata from the database repo > - check integrity, authentication, etc. That=E2=80=99s the idea. > The format of metadata (disassemble) that you propose is schemish > (obviously! :-)) but we could propose something more JSON-like. Sure, if that helps get other people on-board, why not (though sexps have lived much longer than JSON and XML together :-)). Thanks, Ludo=E2=80=99.