Bug #11843
closedTimeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout
Description
Lately we get a lot of jobs whose sole log output is:
Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout
Files
Updated by Joshua Randall over 7 years ago
logs table entries for a pipeline instance that had this issue are attached (logs.txt).
select * from logs where object_uuid = 'z8ta6-d1hrv-lgof5ansiks2owh' order by created_at;
Updated by Tom Clegg over 7 years ago
It's mysterious that the error text doesn't match the most obvious bit of code that would produce such an error, i.e., source:sdk/cli/bin/arv-run-pipeline-instance -- even old versions.
debuglog "create job: #{j[:errors] rescue nil} with attributes #{body}", 0
msg = ""
j[:errors].each do |err|
msg += "Error creating job for component #{component}: #{err}\n"
end
msg += "Job submission was: #{body.to_json}"
pipeline.log_stderr(msg)
Aside from that mystery, it seems like the fix for this is to have arv-run-pipeline-instance treat a timeout or 5xx response from the API as a transient error: follow the "can't make progress right now" code path instead of "fail pipeline".
Updated by Tom Clegg over 7 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version set to 2017-07-19 sprint
- Story points set to 0.5
11843-arpi-transient-error
Updated by Lucas Di Pentima over 7 years ago
- Is it worth writing a test for this issue?
- Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad
Updated by Tom Clegg over 7 years ago
Lucas Di Pentima wrote:
- Is it worth writing a test for this issue?
I wish, but given that crunch1's days are numbered and a-r-p-i doesn't have a good test setup, I'd say no.
- Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad
I don't think it would be particularly convenient and I don't think it would be right either. For example, if the api server is down (even for a long time) I think the best behavior would be to keep trying until it comes up.
Perhaps worth noting that in general when a-r-p-i can't do anything it just exits without doing anything. The #3005 bug was that there are some error cases that indicate a pipeline is effectively deadlocked and should be cancelled. The problem here is just that the #3005 bugfix went too far and interpreted a transient network error as a sign of pipeline deadlock.
Updated by Lucas Di Pentima over 7 years ago
Thanks for the explanation, so this LGTM. Thanks.
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Feedback
Updated by Tom Clegg over 7 years ago
- Target version changed from 2017-07-19 sprint to Arvados Future Sprints
- Story points changed from 0.5 to 0.0
Updated by Tom Clegg over 7 years ago
- Status changed from Feedback to In Progress
Need to update source:services/api/Gemfile.lock
Updated by Tom Clegg almost 7 years ago
- Target version changed from Arvados Future Sprints to 2018-03-14 Sprint
- Story points deleted (
0.0)
Updated by Tom Morris almost 6 years ago
- Status changed from In Progress to Resolved
I'm assuming this either got done or is now irrelevant.