Bug #11843

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout

Added by Joshua Randall 5 months ago. Updated 3 months ago.

Status:In ProgressStart date:06/12/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:-
Target version:Arvados Future Sprints
Story pointsSRemaining (hours)0.00 hour
Velocity based estimate-

Description

Lately we get a lot of jobs whose sole log output is:

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout

logs.txt Magnifier (21 KB) Joshua Randall, 06/15/2017 03:41 pm


Subtasks

Task #11932: Review 11843-arpi-transient-errorResolvedTom Clegg


Related issues

Related to Arvados - Bug #3005: arv-run-pipeline-instance should not keep trying to creat... Resolved

Associated revisions

Revision e2b3986e
Added by Tom Clegg 5 months ago

Merge branch '11843-arpi-transient-error'

refs #11843

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall 5 months ago

logs table entries for a pipeline instance that had this issue are attached (logs.txt).

select * from logs where object_uuid = 'z8ta6-d1hrv-lgof5ansiks2owh' order by created_at;

#2 Updated by Tom Clegg 5 months ago

It's mysterious that the error text doesn't match the most obvious bit of code that would produce such an error, i.e., source:sdk/cli/bin/arv-run-pipeline-instance -- even old versions.

      debuglog "create job: #{j[:errors] rescue nil} with attributes #{body}", 0

      msg = "" 
      j[:errors].each do |err|
        msg += "Error creating job for component #{component}: #{err}\n" 
      end
      msg += "Job submission was: #{body.to_json}" 

      pipeline.log_stderr(msg)

Aside from that mystery, it seems like the fix for this is to have arv-run-pipeline-instance treat a timeout or 5xx response from the API as a transient error: follow the "can't make progress right now" code path instead of "fail pipeline".

#3 Updated by Tom Clegg 5 months ago

  • Status changed from New to In Progress
  • Assignee set to Tom Clegg
  • Target version set to 2017-07-19 sprint
  • Story points set to 0.5

11843-arpi-transient-error

#4 Updated by Lucas Di Pentima 5 months ago

  • Is it worth writing a test for this issue?
  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad

#5 Updated by Tom Clegg 5 months ago

Lucas Di Pentima wrote:

  • Is it worth writing a test for this issue?

I wish, but given that crunch1's days are numbered and a-r-p-i doesn't have a good test setup, I'd say no.

  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad

I don't think it would be particularly convenient and I don't think it would be right either. For example, if the api server is down (even for a long time) I think the best behavior would be to keep trying until it comes up.

Perhaps worth noting that in general when a-r-p-i can't do anything it just exits without doing anything. The #3005 bug was that there are some error cases that indicate a pipeline is effectively deadlocked and should be cancelled. The problem here is just that the #3005 bugfix went too far and interpreted a transient network error as a sign of pipeline deadlock.

#6 Updated by Lucas Di Pentima 5 months ago

Thanks for the explanation, so this LGTM. Thanks.

#7 Updated by Tom Clegg 5 months ago

  • Status changed from In Progress to Feedback

#8 Updated by Tom Clegg 4 months ago

  • Target version changed from 2017-07-19 sprint to Arvados Future Sprints
  • Story points changed from 0.5 to 0.0

#9 Updated by Tom Clegg 3 months ago

  • Status changed from Feedback to In Progress

Also available in: Atom PDF