Project

General

Profile

Actions

Bug #9002

closed

cwl-runner and Workbench disagree about job status

Added by Jiayong Li about 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Files

snap_freebayes_HG01953.log_03 (6.9 KB) snap_freebayes_HG01953.log_03 su92l-d1hrv-8n97u75uovw7ayi Jiayong Li, 04/15/2016 03:57 PM
snap_freebayes_HG01953.log_02 (7.46 KB) snap_freebayes_HG01953.log_02 su92l-d1hrv-r5fuymr5gqt6ut8 Jiayong Li, 04/15/2016 03:57 PM
Actions #1

Updated by Jiayong Li about 8 years ago

I made two consecutive snap_freebayes runs su92l-d1hrv-r5fuymr5gqt6ut8 with log snap_freebayes_HG01953.log_02 and su92l-d1hrv-8n97u75uovw7ayi with log snap_freebayes_HG01953.log_03.

I cancelled process_alignment_2 in 02 run. In the 03 run, the log indicates that process_alignment_2 is queued but never finished.

$ grep process_alignment_2 snap_freebayes_HG01953.log_03
2016-04-14 18:41:15 arvados.cwl-runner[56945] INFO: Job process_alignment_2 (su92l-8i9sb-sh387d64wdv7umi) is Queued
INFO:arvados.cwl-runner:Job process_alignment_2 (su92l-8i9sb-sh387d64wdv7umi) is Queued

However on Workbench, process_alignment_2 is marked as complete, with run time 3h26m, significantly shorter than a usual snap alignment job (on average 7h). I suspect that the job is never actually completed.

Actions #2

Updated by Brett Smith almost 8 years ago

  • Subject changed from [Workbench] cwl jobs incorrectly marked as Complete on workbench to cwl-runner and Workbench disagree about job status

Jiayong,

In both cases, I believe that Workbench has the correct status. The problem is that cwl-runner got out-of-date. It couldn't get updated information about those jobs because the cluster was overloaded at the time. Both logs you attached repeat lines like:

WARNING:root:Retry #1 for request: GET https://su92l.arvadosapi.com/arvados/v1/logs?alt=json&limit=1&order=id+desc&filters=%5B%5B%22object_uuid%22%2C+%22is_a%22%2C+%22arvados%23job%22%5D%5D, following status: 504

This is an API request that will let it monitor the progress of its component jobs. However, in both runs, the request never actually succeeded. Eventually it should've, and then things should've started running normally, but we'll never know for sure since the process was interrupted.

The good news here is that the requests apparently were retried correctly. They continue retrying all the way until the master process is interrupted.

Given all this, I don't see an actionable bug here. Obviously there are a number of performance improvements we could make to do better under load. But it's always possible for there to be times that cwl-runner can't get information from the API server. Given that's the situation we were in here, it looks like it behaved correctly.

Let me know whether you agree, or if you have any follow-up questions.

Actions #3

Updated by Jiayong Li almost 8 years ago

Thanks for following up! I'm convinced that it was the correct behavior.

Actions #4

Updated by Brett Smith almost 8 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF