Bug #11843

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout

Added by Joshua Randall 11 months ago. Updated about 1 month ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/12/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Lately we get a lot of jobs whose sole log output is:

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout
logs.txt (21 KB) logs.txt Joshua Randall, 06/15/2017 03:41 PM

Subtasks

Task #11932: Review 11843-arpi-transient-errorResolvedTom Clegg


Related issues

Related to Arvados - Bug #3005: arv-run-pipeline-instance should not keep trying to create jobs when the API server returns an error on job creationResolved

Associated revisions

Revision e2b3986e
Added by Tom Clegg 10 months ago

Merge branch '11843-arpi-transient-error'

refs #11843

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall 11 months ago

logs table entries for a pipeline instance that had this issue are attached (logs.txt).

select * from logs where object_uuid = 'z8ta6-d1hrv-lgof5ansiks2owh' order by created_at;

#2 Updated by Tom Clegg 10 months ago

It's mysterious that the error text doesn't match the most obvious bit of code that would produce such an error, i.e., source:sdk/cli/bin/arv-run-pipeline-instance -- even old versions.

      debuglog "create job: #{j[:errors] rescue nil} with attributes #{body}", 0

      msg = "" 
      j[:errors].each do |err|
        msg += "Error creating job for component #{component}: #{err}\n" 
      end
      msg += "Job submission was: #{body.to_json}" 

      pipeline.log_stderr(msg)

Aside from that mystery, it seems like the fix for this is to have arv-run-pipeline-instance treat a timeout or 5xx response from the API as a transient error: follow the "can't make progress right now" code path instead of "fail pipeline".

#3 Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version set to 2017-07-19 sprint
  • Story points set to 0.5

11843-arpi-transient-error

#4 Updated by Lucas Di Pentima 10 months ago

  • Is it worth writing a test for this issue?
  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad

#5 Updated by Tom Clegg 10 months ago

Lucas Di Pentima wrote:

  • Is it worth writing a test for this issue?

I wish, but given that crunch1's days are numbered and a-r-p-i doesn't have a good test setup, I'd say no.

  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad

I don't think it would be particularly convenient and I don't think it would be right either. For example, if the api server is down (even for a long time) I think the best behavior would be to keep trying until it comes up.

Perhaps worth noting that in general when a-r-p-i can't do anything it just exits without doing anything. The #3005 bug was that there are some error cases that indicate a pipeline is effectively deadlocked and should be cancelled. The problem here is just that the #3005 bugfix went too far and interpreted a transient network error as a sign of pipeline deadlock.

#6 Updated by Lucas Di Pentima 10 months ago

Thanks for the explanation, so this LGTM. Thanks.

#7 Updated by Tom Clegg 10 months ago

  • Status changed from In Progress to Feedback

#8 Updated by Tom Clegg 9 months ago

  • Target version changed from 2017-07-19 sprint to Arvados Future Sprints
  • Story points changed from 0.5 to 0.0

#9 Updated by Tom Clegg 8 months ago

  • Status changed from Feedback to In Progress

#10 Updated by Tom Clegg about 1 month ago

  • Target version changed from Arvados Future Sprints to 2018-03-14 Sprint
  • Story points deleted (0.0)

Also available in: Atom PDF