Bug #16393

keepproxy timeout on lugli

Added by Ward Vandewege 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Story points:
-
Release relationship:
Auto

Description

Keepproxy on lugli seems to be suffering from a timeout error if it can't write an entire block's worth to keepstore within 15s:

May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg=request RequestID=req-35oyym3mwx6bb8qwm7sk remoteAddr="127.0.0.1:35746" reqBytes=67108864 reqForwardedFor=207.172.223.247 reqHost=keep.lugli.arvadosapi.com reqMethod=PUT reqPath=abc3977c2c709626b57927dc7388a9fc reqQuery=
May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" 
May 01 18:44:39 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:39Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" 
May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" 
May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 0" 
May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" 
May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" 
May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" 
May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 0" 
May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Begin upload abc3977c2c709626b57927dc7388a9fc to http://keep0.lugli.arvadosapi.com:25107" 
May 01 18:45:09 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:09Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Replicas remaining to write: 2 active uploads: 1" 
May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc success" 
May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg="207.172.223.247,127.0.0.1:35746 PUT /abc3977c2c709626b57927dc7388a9fc 200 67108864 2 2 abc3977c2c709626b57927dc7388a9fc+67108864+A5a1c5246763239c3a5d474cdcadba9850dff2b6e@5ebee338 <nil>" 
May 01 18:45:12 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:45:12Z" level=info msg=response RequestID=req-35oyym3mwx6bb8qwm7sk remoteAddr="127.0.0.1:35746" reqBytes=67108864 reqForwardedFor=207.172.223.247 reqHost=keep.lugli.arvadosapi.com reqMethod=PUT reqPath=abc3977c2c709626b57927dc7388a9fc reqQuery= respBytes=92 respStatus=OK respStatusCode=200 timeToStatus=33.241195 timeTotal=33.241230 timeWriteBody=0.000035

The third attempt works, it is much faster than the first 2.

Why does it start out so slow?

Associated revisions

Revision 696df231
Added by Ward Vandewege about 2 months ago

Merge branch '16393-keepproxy-timeout-with-slow-clients'

closes #16393

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

History

#1 Updated by Ward Vandewege 2 months ago

  • Description updated (diff)

What seems to be happening here is that a slow client is uploading data, which keepproxy is streaming through to keepstore. The client doesn't send the whole block within 15s, so keepproxy aborts the connection to keepstore:

May 01 18:44:54 ip-10-255-0-201 keepproxy[2367]: time="2020-05-01T18:44:54Z" level=info msg="DEBUG: [req-35oyym3mwx6bb8qwm7sk] Upload failed http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc error: Put http://keep0.lugli.arvadosapi.com:25107/abc3977c2c709626b57927dc7388a9fc: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" 

That happens once again at the 30s mark. Keepproxy keeps retrying the connection to keepstore, and by the third try the entire upload of the block completes before the 15s timeout window, and the upload to keesptore completes.

The client is none the wiser, the upload appears to complete without problems.

For even slower clients, though, the upload fails because the retry limit for the keepproxy-> keepstore connection is hit before all data can be uploaded to keepproxy.

#2 Updated by Ward Vandewege 2 months ago

  • Target version set to 2020-05-06 Sprint
  • Assigned To set to Ward Vandewege
  • Status changed from New to In Progress

Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients

#3 Updated by Ward Vandewege 2 months ago

  • Release set to 33
  • Category set to Keep

#4 Updated by Peter Amstutz 2 months ago

Ward Vandewege wrote:

Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients

Looking at the original code, it takes the timeout from the passed-in client struct, which appears to be one created at keepproxy.go:128 main(). So maybe the timeout should be set in main() instead?

#5 Updated by Ward Vandewege 2 months ago

Peter Amstutz wrote:

Ward Vandewege wrote:

Ready for review in defdea5df1baf318b3a34c60e2fe148afff66a51 on branch 16393-keepproxy-timeout-with-slow-clients

Looking at the original code, it takes the timeout from the passed-in client struct, which appears to be one created at keepproxy.go:128 main(). So maybe the timeout should be set in main() instead?

I don't think so; the keepclient.MakeKeepClient call does all sorts of stuff, including magic with KeepClient.foundNonDiskSvc, but all that is a red herring (and that magic flag is not exported... why do we have 2 timeout profiles and no way to tell the client which one to use?), I went down that road first and spent a good bit of time there.

The actual timeout that we were hitting was 15s; that was the default value for cluster.API.KeepServiceRequestTimeout, and the http.Client timeout for the connection to keepstore is set to that value (by way of MakeRESTRouter) on line 621 in keepproxy.go, which is what I changed.

And the fix works, as tested with a binary on lugli.

#6 Updated by Peter Amstutz about 2 months ago

My original comment was wrong.

The place where the timeout needs to be changed is on line 166:

@router = MakeRESTRouter(kc, time.Duration(cluster.API.KeepServiceRequestTimeout), cluster.ManagementToken)@

This sets the value of h.timeout that gets used to initialize the keepclient later on.

Hardcoding the timeout in makeKeepClient may break a couple of tests which call MakeRESTRouter() with a different timeout.

#7 Updated by Ward Vandewege about 2 months ago

Thanks, I made this change in b4603c6c684246942d6d8747d51b441b9ffc42fb. Tests pass, and I've tested the binary on lugli as well, it still works as expected.

#8 Updated by Peter Amstutz about 2 months ago

Ward Vandewege wrote:

Thanks, I made this change in b4603c6c684246942d6d8747d51b441b9ffc42fb. Tests pass, and I've tested the binary on lugli as well, it still works as expected.

Perfect, thank you, LGTM.

#9 Updated by Anonymous about 2 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Also available in: Atom PDF