Project

General

Profile

Actions

Bug #11843

closed

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout

Added by Joshua Randall over 7 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Lately we get a lot of jobs whose sole log output is:

Timeout attempting to create job for component gatk-haplotypecaller-cram-gvcf: Net::ReadTimeout

Files

logs.txt (21 KB) logs.txt Joshua Randall, 06/15/2017 03:41 PM

Subtasks 1 (0 open1 closed)

Task #11932: Review 11843-arpi-transient-errorResolvedTom Clegg06/12/2017Actions

Related issues

Related to Arvados - Bug #3005: arv-run-pipeline-instance should not keep trying to create jobs when the API server returns an error on job creationResolvedActions
Actions #1

Updated by Joshua Randall over 7 years ago

logs table entries for a pipeline instance that had this issue are attached (logs.txt).

select * from logs where object_uuid = 'z8ta6-d1hrv-lgof5ansiks2owh' order by created_at;
Actions #2

Updated by Tom Clegg over 7 years ago

It's mysterious that the error text doesn't match the most obvious bit of code that would produce such an error, i.e., source:sdk/cli/bin/arv-run-pipeline-instance -- even old versions.

      debuglog "create job: #{j[:errors] rescue nil} with attributes #{body}", 0

      msg = "" 
      j[:errors].each do |err|
        msg += "Error creating job for component #{component}: #{err}\n" 
      end
      msg += "Job submission was: #{body.to_json}" 

      pipeline.log_stderr(msg)

Aside from that mystery, it seems like the fix for this is to have arv-run-pipeline-instance treat a timeout or 5xx response from the API as a transient error: follow the "can't make progress right now" code path instead of "fail pipeline".

Actions #3

Updated by Tom Clegg about 7 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version set to 2017-07-19 sprint
  • Story points set to 0.5

11843-arpi-transient-error

Actions #4

Updated by Lucas Di Pentima about 7 years ago

  • Is it worth writing a test for this issue?
  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad
Actions #5

Updated by Tom Clegg about 7 years ago

Lucas Di Pentima wrote:

  • Is it worth writing a test for this issue?

I wish, but given that crunch1's days are numbered and a-r-p-i doesn't have a good test setup, I'd say no.

  • Taking into account #3005, would it be convenient to only retry a defined number of times? Retrying on any kind of exception seems too broad

I don't think it would be particularly convenient and I don't think it would be right either. For example, if the api server is down (even for a long time) I think the best behavior would be to keep trying until it comes up.

Perhaps worth noting that in general when a-r-p-i can't do anything it just exits without doing anything. The #3005 bug was that there are some error cases that indicate a pipeline is effectively deadlocked and should be cancelled. The problem here is just that the #3005 bugfix went too far and interpreted a transient network error as a sign of pipeline deadlock.

Actions #6

Updated by Lucas Di Pentima about 7 years ago

Thanks for the explanation, so this LGTM. Thanks.

Actions #7

Updated by Tom Clegg about 7 years ago

  • Status changed from In Progress to Feedback
Actions #8

Updated by Tom Clegg about 7 years ago

  • Target version changed from 2017-07-19 sprint to Arvados Future Sprints
  • Story points changed from 0.5 to 0.0
Actions #9

Updated by Tom Clegg about 7 years ago

  • Status changed from Feedback to In Progress
Actions #10

Updated by Tom Clegg over 6 years ago

  • Target version changed from Arvados Future Sprints to 2018-03-14 Sprint
  • Story points deleted (0.0)
Actions #11

Updated by Tom Morris about 6 years ago

  • Release set to 17
Actions #12

Updated by Tom Morris over 5 years ago

  • Status changed from In Progress to Resolved

I'm assuming this either got done or is now irrelevant.

Actions

Also available in: Atom PDF