Project

General

Profile

Actions

Bug #4303

closed

[Crunch] Output of job gets stuck at collate step

Added by Bryan Cosca over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
0.5

Description

This job (qr1hi-8i9sb-56vgstlp2wk56vn) ran for ~6h and while watching the logs it seemed to have stalled towards its end. After ~30 minutes of waiting (logs did not update in 30 minutes) I cancelled it in order to start over. This may be due to the pipeline instance being failed but the job still running.


Subtasks 1 (0 open1 closed)

Task #4385: Review 4303-crunch-retry-logging-wipResolvedPeter Amstutz10/31/2014Actions

Related issues

Has duplicate Arvados - Bug #4567: [Crunch] job dies but is left in state "locked"Closed11/17/2014Actions
Actions #1

Updated by Radhika Chippada over 9 years ago

  • Target version set to Bug Triage
Actions #2

Updated by Ward Vandewege over 9 years ago

  • Subject changed from Output of job gets stuck at collate step to [Crunch] Output of job gets stuck at collate step
  • Story points set to 0.5
Actions #3

Updated by Brett Smith over 9 years ago

Two things happened here:

  • The Docker image used has an older Python SDK that does not record Keep's authorization signatures. Fixing that is as easy as preparing an updated image with a newer Python SDK.
  • Crunch's retry logic kicked in. The retry logic in the Python SDK is smarter about distinguishing temporary and permanent failures, so when we port crunch-job to Python, this issue will be resolved more or less automatically. In the meantime, since crunch-job is willing to retry for quite a while, we should probably log when this happens so the user at least gets a hint about what's going on.
Actions #4

Updated by Brett Smith over 9 years ago

  • Category set to Crunch
  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Bug Triage to 2014-11-19 sprint

4303-crunch-retry-logging-wip is up for review, to handle the quick fix suggested above, until we port to Python.

Actions #5

Updated by Peter Amstutz over 9 years ago

The +x bits got removed:

old mode 100755
new mode 100644

After fixing the mode, I get this error:

&main::retry_callback called at /home/peter/work/arvados/sdk/cli/bin/crunch-job line 1671
Actions #6

Updated by Brett Smith over 9 years ago

Peter Amstutz wrote:

After fixing the mode, I get this error:

Both fixed in 625ac982 and ready for another look. Thanks.

Actions #7

Updated by Peter Amstutz over 9 years ago

As it happens, I was able (by accident) to get an error that triggered the retry, the way it gets printed out is a little funny:

4n8aq-8i9sb-6l22ggrmzcj2f8t !  10209  API method users/current failed: API call /users/current failed: 401 Unauthorized
4n8aq-8i9sb-6l22ggrmzcj2f8t !  10209  {"errors":["Not logged in"],"error_token":"1415124192+b4338adf"} at /home/peter/work/arvados/sdk/cli/bin/crunch-job line 1697.
4n8aq-8i9sb-6l22ggrmzcj2f8t !  10209  . Retrying.

I think it might be better to print something each time it retries, along with a message of how long it is going to wait until the next try. Telling the user that it is starting to retry is better than nothing at all, but if it spends 5 minutes doing retries with no further messages, the user has no idea if crunch-job is still making an effort or is completely deadlocked.

Actions #8

Updated by Brett Smith over 9 years ago

d5e7371a strips out the Perl line information, adds retry time information, and normalizes the whitespace. Ready for another look.

Actions #9

Updated by Peter Amstutz over 9 years ago

$retry_msg should be $next_retry_at here:

my $next_try_fmt = strftime("%Y-%m-%d %H:%M:%S", $retry_msg);

Personally I prefer "retrying in N minutes/seconds" rather than specifying an exact time (log lines are timestamped already so the user can do the math) but that's not such a big deal.

The rest of it LGTM.

Actions #10

Updated by Brett Smith over 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:f91dff0f2f49a83047e31b1a56dbddd85be6926a.

Actions

Also available in: Atom PDF