Bug #13959

crunch-dispatch-slurm / Go SDK Dispatcher can block indefinitely on d.Arv.List("containers", params, &list)

Added by Joshua Randall almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
08/21/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

In the main loop of the Dispatcher's checkForUpdates function, the API List request to get a batch of matching containers appears to sometimes block forever. I'm not sure why this happens, but it is probably due to a network or API server issue. In any case, there should be some client side timeout that prevents this loop from hanging.

Our current workaround is to have a cron job that calls `systemctl restart crunch-dispatch-slurm` on an hourly basis so that if the dispatcher gets hung it will be fixed at the next top of the hour.


Subtasks

Task #14055: Review 13959-timeouts-and-loggingResolvedTom Clegg

Associated revisions

Revision 794cecd8
Added by Tom Clegg almost 2 years ago

Merge branch '13959-timeouts-and-logging'

fixes #13959

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Morris almost 2 years ago

  • Target version set to To Be Groomed

#2 Updated by Tom Morris almost 2 years ago

  • Target version changed from To Be Groomed to 2018-09-05 Sprint

#3 Updated by Tom Clegg almost 2 years ago

  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg almost 2 years ago

13959-timeouts-and-logging @ f739d736bbb60a8463f04f5d56c18d09157d820e
  • default 5-minute timeout (instead of no timeout) on API calls in sdk/go/arvadosclient (we already have this in sdk/go/arvados)
  • move crunch-dispatch-slurm, crunch-dispatch-local, and dispatch library logging to logrus, making it easier to add Debugf() for future debugging
  •         testWithServerStub(c, apiStubResponses, "echo",
    -               `After echo process termination, container state for Running is "zzzzz-dz642-xxxxxxxxxxxxxx2".  Updating it to "Cancelled"`)
    +               `after "echo" process termination, container state for zzzzz-dz642-xxxxxxxxxxxxxx2 is "Running"; updating it to "Cancelled"`)
    

https://ci.curoverse.com/job/developer-run-tests/858/

#6 Updated by Peter Amstutz almost 2 years ago

nit, from https://github.com/Sirupsen/logrus README:

It's in the past been possible to import Logrus as both upper- and lower-case. Due to the Go package environment, this caused issues in the community and we needed a standard. Some environments experienced problems with the upper-case variant, so the lower-case was decided. Everything using logrus will need to use the lower-case: github.com/sirupsen/logrus. Any package that isn't, should be changed.

Rest LGTM.

#7 Updated by Tom Clegg almost 2 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg almost 2 years ago

  • Status changed from In Progress to Resolved

#9 Updated by Ward Vandewege almost 2 years ago

  • Release set to 13

Also available in: Atom PDF