Bug #11843
Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout
100%
Description
Lately we get a lot of jobs whose sole log output is:
Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout
Subtasks
Related issues
Associated revisions
History
#1
Updated by Joshua Randall over 2 years ago
logs table entries for a pipeline instance that had this issue are attached (logs.txt).
select * from logs where object_uuid = 'z8ta6-d1hrv-lgof5ansiks2owh' order by created_at;
#2
Updated by Tom Clegg over 2 years ago
It's mysterious that the error text doesn't match the most obvious bit of code that would produce such an error, i.e., source:sdk/cli/bin/arv-run-pipeline-instance -- even old versions.
debuglog "create job: #{j[:errors] rescue nil} with attributes #{body}", 0
msg = ""
j[:errors].each do |err|
msg += "Error creating job for component #{component}: #{err}\n"
end
msg += "Job submission was: #{body.to_json}"
pipeline.log_stderr(msg)
Aside from that mystery, it seems like the fix for this is to have arv-run-pipeline-instance treat a timeout or 5xx response from the API as a transient error: follow the "can't make progress right now" code path instead of "fail pipeline".
#3
Updated by Tom Clegg over 2 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version set to 2017-07-19 sprint
- Story points set to 0.5
11843-arpi-transient-error
#4
Updated by Lucas Di Pentima over 2 years ago
- Is it worth writing a test for this issue?
- Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad
#5
Updated by Tom Clegg over 2 years ago
Lucas Di Pentima wrote:
- Is it worth writing a test for this issue?
I wish, but given that crunch1's days are numbered and a-r-p-i doesn't have a good test setup, I'd say no.
- Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad
I don't think it would be particularly convenient and I don't think it would be right either. For example, if the api server is down (even for a long time) I think the best behavior would be to keep trying until it comes up.
Perhaps worth noting that in general when a-r-p-i can't do anything it just exits without doing anything. The #3005 bug was that there are some error cases that indicate a pipeline is effectively deadlocked and should be cancelled. The problem here is just that the #3005 bugfix went too far and interpreted a transient network error as a sign of pipeline deadlock.
#6
Updated by Lucas Di Pentima over 2 years ago
Thanks for the explanation, so this LGTM. Thanks.
#7
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Feedback
#8
Updated by Tom Clegg over 2 years ago
- Target version changed from 2017-07-19 sprint to Arvados Future Sprints
- Story points changed from 0.5 to 0.0
#9
Updated by Tom Clegg over 2 years ago
- Status changed from Feedback to In Progress
Need to update source:services/api/Gemfile.lock
#10
Updated by Tom Clegg over 1 year ago
- Target version changed from Arvados Future Sprints to 2018-03-14 Sprint
- Story points deleted (
0.0)
#11
Updated by Tom Morris over 1 year ago
- Release set to 17
#12
Updated by Tom Morris 9 months ago
- Status changed from In Progress to Resolved
I'm assuming this either got done or is now irrelevant.
Merge branch '11843-arpi-transient-error'
refs #11843
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curoverse.com>