substitute server connection timeout

OpenSubmitted by Mathieu Othacehe.
Details
3 participants
  • Ludovic Courtès
  • Christopher Baines
  • Mathieu Othacehe
Owner
unassigned
Severity
important
M
M
Mathieu Othacehe wrote on 16 May 2021 19:57
(address . bug-guix@gnu.org)
87lf8e4l42.fsf@gnu.org
Hello,

We recently have a lot of those errors on Cuirass:

Toggle snippet (5 lines)
guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
guix substitute: warning: try `--no-substitutes' if the problem persists
guix substitute: error: connect*: Connection timed out

which means that the workers are failing to connect to the Cuirass
remote-server publish server on berlin at 141.80.167.131:5557.

Stracing this publish server shows that connection reuse seems to be
broken:

Toggle snippet (23 lines)
accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21

Investigating it, I found that the connection is closed and opened
multiple times in the call-with-cached-connection procedure of the (guix
script substitute) module.

It looks like its because a 'bad-headers exception is raised when trying
to parse an eof object:

Toggle snippet (3 lines)
;;; (error bad-header (read-header-line #<eof>))

I'm not sure where this eof comes from. There is this comment in the
http-multiple-get procedure in (guix http-client):

Toggle snippet (10 lines)
;; Swallow networking errors that could occur due to connection reuse
;; and the like; they will be handled down the road when trying to
;; read responses.
(false-if-networking-error
(begin
(for-each (cut write-request <> buffer) batch)
(put-bytevector p (get))
(force-output p))))

which would suggest that connection reuse could cause networking errors?

What also puzzles me it that the main guix publish server on berlin does
not seem to present this issue. That would indicate that this error is
caused by how the Cuirass remote-server publish server is started or
configured.

Ludo, Chris, any idea?

I will keep searching anyway :)

Thanks,

Mathieu
C
C
Christopher Baines wrote on 16 May 2021 20:26
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87eee6cz75.fsf@cbaines.net
Mathieu Othacehe <othacehe@gnu.org> writes:

Toggle quote (74 lines)
> Hello,
>
> We recently have a lot of those errors on Cuirass:
>
> --8<---------------cut here---------------start------------->8---
> guix substitute: warning: while fetching http://141.80.167.131:5557/nar/g7ka09613k5v1vlznh87yg35905ggw51-python2-scipy-1.2.2-guile-builder: server is somewhat slow
> guix substitute: warning: try `--no-substitutes' if the problem persists
> guix substitute: error: connect*: Connection timed out
> --8<---------------cut here---------------end--------------->8---
>
> which means that the workers are failing to connect to the Cuirass
> remote-server publish server on berlin at 141.80.167.131:5557.
>
> Stracing this publish server shows that connection reuse seems to be
> broken:
>
> --8<---------------cut here---------------start------------->8---
> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41752), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41754), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
> accept4(9, {sa_family=AF_INET, sin_port=htons(41756), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41758), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 26
> accept4(9, {sa_family=AF_INET, sin_port=htons(41760), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
> accept4(9, {sa_family=AF_INET, sin_port=htons(41762), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41764), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41766), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41768), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 22
> accept4(9, {sa_family=AF_INET, sin_port=htons(41770), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41772), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41774), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41776), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41778), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41780), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> accept4(9, {sa_family=AF_INET, sin_port=htons(41782), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
> --8<---------------cut here---------------end--------------->8---
>
> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> --8<---------------cut here---------------start------------->8---
> ;;; (error bad-header (read-header-line #<eof>))
> --8<---------------cut here---------------end--------------->8---
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> --8<---------------cut here---------------start------------->8---
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
> (begin
> (for-each (cut write-request <> buffer) batch)
> (put-bytevector p (get))
> (force-output p))))
> --8<---------------cut here---------------end--------------->8---
>
> which would suggest that connection reuse could cause networking errors?
>
> What also puzzles me it that the main guix publish server on berlin does
> not seem to present this issue. That would indicate that this error is
> caused by how the Cuirass remote-server publish server is started or
> configured.
>
> Ludo, Chris, any idea?

While I've been working in this area, I've actually been trying to pick
apart the connection caching, since the single thread assumption doesn't
hold in the Guix Build Coordinator.

Anyway, I do have a theory. Assuming I'm correct in saying that there's
no nginx between the client and publish server here, I think that's your
configuration difference.

For ci.guix.gnu.org, as well as data.guix.gnu.org, it's NGinx which is
keeping connections alive. I'm not sure the Guile code for the publish
server does similarly, so talking to it directly might be different.

That's on the server side, the actual problem is probably on the client
side, as I guess there are possibly places where closed connections
aren't handled properly. This reminds me I sent some patches relating to
closing connections, this could well be related [1].

-----BEGIN PGP SIGNATURE-----

iQKlBAEBCgCPFiEEPonu50WOcg2XVOCyXiijOwuE9XcFAmChY89fFIAAAAAALgAo
aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDNF
ODlFRUU3NDU4RTcyMEQ5NzU0RTBCMjVFMjhBMzNCMEI4NEY1NzcRHG1haWxAY2Jh
aW5lcy5uZXQACgkQXiijOwuE9XfG9w/+KvzCHwfnB+eeEd6i0xTvTfc8+oLxb+GS
j1TSbXSGR02ByGMGIbJeAMmNPV20V9Lzi24a49451fskUZbCGeg6aS2rO3c9QpUX
jbF7//NW8zwgmhO/Hh9dvB4v+q01+VwrzE3c6ssAVO6772ZrTJqGA7oCU9UqD1w6
dQYUto8iSxa95UhFcNFMtGcBAevdoN4Ku9OpkqIt8uViCED+YM6a9Ej4d9f9oC8h
vxw6EU0YsCrobJQJ8KW56N9uR4oeI4IqWevou2LbSAmcHEWmGHAF6XDgxu0594Ip
YOFuqso4nuRxU3ffzgEupMbJA07yrwf0EB4lreqsSfU8gtZupFMLgyc8PviyXjw3
XPpo3PAukKhsVZyQlnTc6ldatWWVHIeTPy8vxdr6GvSQnzliaQmzpHj7xVqkfS3f
R1gz5hEXoS0o/CXpVkpnaYMC8m+Y2ujM5AbEUBVo6itR6mmUCCF5D9q/BtosjQvu
8wrLes42iAqwlnNeg8UGKra/cwJ0YGVrOInsOyXX97jz3UVjkxV7jq53+vkw3fGZ
eyHDugAJ8DKw9js8BX88AmXaV/Art4M0qG6D6Zp+MvIUbU9z+oNeTjAXgGewxntw
sfIdxHju+DNs5x5je3pTzyRSDPEfle8Xvd58DbGyAIoP8v7C7LYPmxVID7R0D/cs
ccIQKCj/kqg=
=yVQQ
-----END PGP SIGNATURE-----

M
M
Mathieu Othacehe wrote on 17 May 2021 16:49
(name . Christopher Baines)(address . mail@cbaines.net)
87eee59zzm.fsf@gnu.org
Hey,

Toggle quote (5 lines)
> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly. This reminds me I sent some patches relating to
> closing connections, this could well be related [1].

Oh, you're right, the Nginx server probably makes the difference
here. The http-write procedure of the Guile (web server http) module
seems to handle keep-alive connections. However, the wrapping http-write
procedure of (guix scripts publish) may not.

I'll have a closer look, thanks for your help.

Mathieu
M
M
Mathieu Othacehe wrote on 18 May 2021 16:35
(name . Christopher Baines)(address . mail@cbaines.net)
878s4c85zu.fsf@gnu.org
Hey,

Toggle quote (2 lines)
> I'll have a closer look, thanks for your help.

So this snippet in the http-write procedure of the (guix scripts
publish) module:

Toggle snippet (4 lines)
(swallow-zlib-error
(close-port port))

is closing the client port unconditionally, which means that guix
publish cannot keep connections alive, unless sitting behind an Nginx
proxy.

I'm trying to turn the close-port call into a maybe-close-port with the
following procedure:

Toggle snippet (8 lines)
(define (maybe-close-port port)
(cond
((keep-alive? response)
(poll-set-add! (http-poll-set server) port *events*))
(else
(close-port port))))

however this is terribly hacky, as I need to access the private poll-set
from (web server http).

Ludo, do you have a better idea?

Thanks,

Mathieu
M
M
Mathieu Othacehe wrote on 21 May 2021 15:30
(name . Christopher Baines)(address . mail@cbaines.net)
87cztkqkns.fsf@gnu.org
Hey,

I posted a patchset adding keep-alive support to guix publish earlier:

Thanks,

Mathieu
L
L
Ludovic Courtès wrote on 29 May 2021 23:04
control message for bug #48468
(address . control@debbugs.gnu.org)
877djh5k0g.fsf@gnu.org
severity 48468 important
quit
L
L
Ludovic Courtès wrote on 29 May 2021 23:44
Re: bug#48468: substitute server connection timeout
(name . Christopher Baines)(address . mail@cbaines.net)
87v97143lo.fsf@gnu.org
Hi,

Christopher Baines <mail@cbaines.net> skribis:

Toggle quote (2 lines)
> Mathieu Othacehe <othacehe@gnu.org> writes:

[...]

Toggle quote (10 lines)
>> Stracing this publish server shows that connection reuse seems to be
>> broken:
>>
>> --8<---------------cut here---------------start------------->8---
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41742), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41744), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41746), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 25
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41748), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 24
>> accept4(9, {sa_family=AF_INET, sin_port=htons(41750), sin_addr=inet_addr("141.80.167.185")}, [112->16], 0) = 21

Ouch.

Toggle quote (27 lines)
>> Investigating it, I found that the connection is closed and opened
>> multiple times in the call-with-cached-connection procedure of the (guix
>> script substitute) module.
>>
>> It looks like its because a 'bad-headers exception is raised when trying
>> to parse an eof object:
>>
>> --8<---------------cut here---------------start------------->8---
>> ;;; (error bad-header (read-header-line #<eof>))
>> --8<---------------cut here---------------end--------------->8---
>>
>> I'm not sure where this eof comes from. There is this comment in the
>> http-multiple-get procedure in (guix http-client):
>>
>> --8<---------------cut here---------------start------------->8---
>> ;; Swallow networking errors that could occur due to connection reuse
>> ;; and the like; they will be handled down the road when trying to
>> ;; read responses.
>> (false-if-networking-error
>> (begin
>> (for-each (cut write-request <> buffer) batch)
>> (put-bytevector p (get))
>> (force-output p))))
>> --8<---------------cut here---------------end--------------->8---
>>
>> which would suggest that connection reuse could cause networking errors?

[...]

Toggle quote (4 lines)
> That's on the server side, the actual problem is probably on the client
> side, as I guess there are possibly places where closed connections
> aren't handled properly.

D’oh. So we should be able to use it by talking to a bare ‘guix
publish’? I tried and failed to reproduce it, but that’s probably not
deterministic.

Toggle quote (5 lines)
> This reminds me I sent some patches relating to closing connections,
> this could well be related [1].
>
> 1: https://issues.guix.gnu.org/47174

This one will properly close connections when the servers asks for it,
but would that make a difference?

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 18 Jun 2021 14:33
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 48468@debbugs.gnu.org)
87im2bfj3i.fsf@gnu.org
Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (24 lines)
> Investigating it, I found that the connection is closed and opened
> multiple times in the call-with-cached-connection procedure of the (guix
> script substitute) module.
>
> It looks like its because a 'bad-headers exception is raised when trying
> to parse an eof object:
>
> ;;; (error bad-header (read-header-line #<eof>))
>
>
> I'm not sure where this eof comes from. There is this comment in the
> http-multiple-get procedure in (guix http-client):
>
> ;; Swallow networking errors that could occur due to connection reuse
> ;; and the like; they will be handled down the road when trying to
> ;; read responses.
> (false-if-networking-error
> (begin
> (for-each (cut write-request <> buffer) batch)
> (put-bytevector p (get))
> (force-output p))))
>
> which would suggest that connection reuse could cause networking errors?

Trying to see exactly which bit is at fault here. First,
‘http-multiple-get’ seems to be working as expected when passed a fresh
connection:

Toggle snippet (15 lines)
scheme@(guix http-client)> (http-multiple-get (string->uri "https://ci.guix.gnu.org")
(lambda (request response port result)
(get-bytevector-n port (response-content-length response))
(cons #t result))
'()
(make-list
5000
(build-request (build-uri 'https
#:host "ci.guix.gnu.org"
#:path "/d7gpr41qpsifri54vp2lzjs99zkylscq.narinfo"))))
connecting (5000 requests left)...connecting (4400 requests left)...connecting (3800 requests left)...connecting (3200 requests left)...connecting (2600 requests left)...connecting (2000 requests left)...connecting (1400 requests left)...connecting (800 requests left)...connecting (200 requests left)...$143 = (#t #t …)
scheme@(guix http-client)> (length $143)
$144 = 5000

But maybe I’m not looking at the right thing.

Do you have evidence or a reproducer?

Thanks,
Ludo’.
M
M
Mathieu Othacehe wrote on 29 Jun 2021 18:49
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 48468@debbugs.gnu.org)
87sg108vl5.fsf@gnu.org
Hey,

Toggle quote (4 lines)
> But maybe I’m not looking at the right thing.
>
> Do you have evidence or a reproducer?

Yes, adding the following debug message here:

Toggle snippet (12 lines)
--- a/guix/scripts/substitute.scm
+++ b/guix/scripts/substitute.scm
@@ -416,6 +416,7 @@ server certificates."
;; and retry. We might also get 'bad-response or a similar
;; exception from (web response) later on, once we've sent the
;; request, or a ERROR/INVALID-SESSION from GnuTLS.
+ (pk key args)
(if (or (and (eq? key 'system-error)
(= EPIPE (system-error-errno `(,key ,@args))))
(and (eq? key 'gnutls-error)

and using substitutes from a publish server without keep alive support
(before 0b8fa24), prints something like:

Toggle snippet (14 lines)
downloading from http://192.168.1.51:8080/nar/gzip/lkmiyfzi1bq571yn4gypbcs5vn4fpma7-texlive-cm-51265 ...
texlive-cm-51265 7.8MiB/s 00:00 | 1.7MiB transferred

substituting /gnu/store/l788x07ska5vffayz0gayv4hsx5flxal-module-import-compiled...

;;; (bad-header (read-header-line #<eof>))
downloading from http://192.168.1.51:8080/nar/gzip/l788x07ska5vffayz0gayv4hsx5flxal-module-import-compiled ...
module-import-compiled 7.5MiB/s 00:00 | 85KiB transferred

substituting /gnu/store/1s1lrnxlkjwxshk5q2w97ig3clc6n06f-ruby-2.6.5...

;;; (bad-header (read-header-line #<eof>))

That's because the connection is closed by the publish server each time
a NAR is sent. This particular behaviour is fixed by:

I hoped that it would decrease the load of the Cuirass publish server
that was flooded by worker connections. While the situation is better,
there are still a lot of substitute timeout errors on Cuirass.

I have also observed those timeouts using the publish server behind
ci.guix.gnu.org. This makes me think that there's something else that is
preventing the server from honoring every connection request in less
than 5 seconds (the timeout duration).

Thanks,

Mathieu
?