Bug #5500

[Crunch] Detect temporary error conditions

Added by Peter Amstutz over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/18/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

tb05z-8i9sb-2e16dypy0eg7m59

2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr You are using pip version 6.0.6, however version 6.0.8 is available.
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr You should consider upgrading via the 'pip install --upgrade pip' command.
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr   Hash of the package https://pypi.python.org/packages/source/h/httplib2/httplib2-0.9.tar.gz#md5=09d8e8016911fc40e2e4c58f1aa3ec24 (from https://pypi.python.org/simple/httplib2/) (db87123118b60fecc4b91288e9f988c0) doesn't match the expected hash 09d8e8016911fc40e2e4c58f1aa3ec24!
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr   Bad md5 hash for package https://pypi.python.org/packages/source/h/httplib2/httplib2-0.9.tar.gz#md5=09d8e8016911fc40e2e4c58f1aa3ec24 (from https://pypi.python.org/simple/httplib2/)
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr /tmp/crunch-job-work/.arvados.venv/bin/pip --quiet install -I /tmp/crunch-job/opt/python failed (): exit 1 signal 0 at - line 198.
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr srun: error: compute1: task 0: Exited with exit code 29
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 child 11234 on compute1.1 exit 29 success=
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 failure (#1, permanent) after 44 seconds
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Task launch for 381.108 failed on node compute1: Communication connection failure
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Application launch failed: Communication connection failure
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Timed out waiting for job step to complete
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 child 13887 on compute1.16 exit 1 success=
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 failure (#1, permanent) after 7 seconds

Subtasks

Task #5501: Review 5500-crunch-temporary-failureResolvedPeter Amstutz

Associated revisions

Revision 1ed0df4d
Added by Peter Amstutz over 5 years ago

Merge branch '5500-crunch-temporary-failure' refs #5500

History

#1 Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
  • Category set to Crunch
  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#2 Updated by Peter Amstutz over 5 years ago

  • Target version changed from Bug Triage to 2015-04-01 sprint

#3 Updated by Tom Clegg over 5 years ago

At 3365d47
  • $tempfail should probably be called $exitcode: it doesn't actually signify tempfail unless it happens to be 111. If your intent is to make it more clear what 111 means, use constant TEMPFAIL => 111; might be better?
  • I'm not sure $code >> 8 is necessarily non-zero if the child is killed by a signal. Perhaps we should exit (($code >> 8) || 1) to make sure we never accidentally exit 0 here?
  • The "complain but don't exit" version of die is warn, not print STDERR: we should either use warn (which includes the line number in the install script, fwiw) or add a \n to the end of the message.

The slurm part lgtm. :)

#5 Updated by Tom Clegg over 5 years ago

a239e2d LGTM, thanks

#6 Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF