Bug #4303
closed
[Crunch] Output of job gets stuck at collate step
Added by Bryan Cosca about 10 years ago.
Updated about 10 years ago.
Description
This job (qr1hi-8i9sb-56vgstlp2wk56vn) ran for ~6h and while watching the logs it seemed to have stalled towards its end. After ~30 minutes of waiting (logs did not update in 30 minutes) I cancelled it in order to start over. This may be due to the pipeline instance being failed but the job still running.
- Target version set to Bug Triage
- Subject changed from Output of job gets stuck at collate step to [Crunch] Output of job gets stuck at collate step
- Story points set to 0.5
Two things happened here:
- The Docker image used has an older Python SDK that does not record Keep's authorization signatures. Fixing that is as easy as preparing an updated image with a newer Python SDK.
- Crunch's retry logic kicked in. The retry logic in the Python SDK is smarter about distinguishing temporary and permanent failures, so when we port crunch-job to Python, this issue will be resolved more or less automatically. In the meantime, since crunch-job is willing to retry for quite a while, we should probably log when this happens so the user at least gets a hint about what's going on.
- Category set to Crunch
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Bug Triage to 2014-11-19 sprint
4303-crunch-retry-logging-wip is up for review, to handle the quick fix suggested above, until we port to Python.
The +x bits got removed:
old mode 100755
new mode 100644
After fixing the mode, I get this error:
&main::retry_callback called at /home/peter/work/arvados/sdk/cli/bin/crunch-job line 1671
Peter Amstutz wrote:
After fixing the mode, I get this error:
Both fixed in 625ac982 and ready for another look. Thanks.
As it happens, I was able (by accident) to get an error that triggered the retry, the way it gets printed out is a little funny:
4n8aq-8i9sb-6l22ggrmzcj2f8t ! 10209 API method users/current failed: API call /users/current failed: 401 Unauthorized
4n8aq-8i9sb-6l22ggrmzcj2f8t ! 10209 {"errors":["Not logged in"],"error_token":"1415124192+b4338adf"} at /home/peter/work/arvados/sdk/cli/bin/crunch-job line 1697.
4n8aq-8i9sb-6l22ggrmzcj2f8t ! 10209 . Retrying.
I think it might be better to print something each time it retries, along with a message of how long it is going to wait until the next try. Telling the user that it is starting to retry is better than nothing at all, but if it spends 5 minutes doing retries with no further messages, the user has no idea if crunch-job is still making an effort or is completely deadlocked.
d5e7371a strips out the Perl line information, adds retry time information, and normalizes the whitespace. Ready for another look.
$retry_msg
should be $next_retry_at
here:
my $next_try_fmt = strftime("%Y-%m-%d %H:%M:%S", $retry_msg);
Personally I prefer "retrying in N minutes/seconds" rather than specifying an exact time (log lines are timestamped already so the user can do the math) but that's not such a big deal.
The rest of it LGTM.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:f91dff0f2f49a83047e31b1a56dbddd85be6926a.
Also available in: Atom
PDF