Project

General

Profile

Actions

Bug #5500

closed

[Crunch] Detect temporary error conditions

Added by Peter Amstutz over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

tb05z-8i9sb-2e16dypy0eg7m59

2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr You are using pip version 6.0.6, however version 6.0.8 is available.
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr You should consider upgrading via the 'pip install --upgrade pip' command.
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr   Hash of the package https://pypi.python.org/packages/source/h/httplib2/httplib2-0.9.tar.gz#md5=09d8e8016911fc40e2e4c58f1aa3ec24 (from https://pypi.python.org/simple/httplib2/) (db87123118b60fecc4b91288e9f988c0) doesn't match the expected hash 09d8e8016911fc40e2e4c58f1aa3ec24!
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr   Bad md5 hash for package https://pypi.python.org/packages/source/h/httplib2/httplib2-0.9.tar.gz#md5=09d8e8016911fc40e2e4c58f1aa3ec24 (from https://pypi.python.org/simple/httplib2/)
2015-03-17_21:29:22 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr /tmp/crunch-job-work/.arvados.venv/bin/pip --quiet install -I /tmp/crunch-job/opt/python failed (): exit 1 signal 0 at - line 198.
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 stderr srun: error: compute1: task 0: Exited with exit code 29
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 child 11234 on compute1.1 exit 29 success=
2015-03-17_21:29:23 tb05z-8i9sb-2e16dypy0eg7m59 15615 50 failure (#1, permanent) after 44 seconds
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Task launch for 381.108 failed on node compute1: Communication connection failure
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Application launch failed: Communication connection failure
2015-03-17_23:59:15 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 stderr srun: error: Timed out waiting for job step to complete
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 child 13887 on compute1.16 exit 1 success=
2015-03-17_23:59:17 tb05z-8i9sb-2e16dypy0eg7m59 15615 105 failure (#1, permanent) after 7 seconds

Subtasks 1 (0 open1 closed)

Task #5501: Review 5500-crunch-temporary-failureResolvedPeter Amstutz03/18/2015Actions
Actions #1

Updated by Peter Amstutz over 9 years ago

  • Description updated (diff)
  • Category set to Crunch
  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz
Actions #2

Updated by Peter Amstutz over 9 years ago

  • Target version changed from Bug Triage to 2015-04-01 sprint
Actions #3

Updated by Tom Clegg over 9 years ago

At 3365d47
  • $tempfail should probably be called $exitcode: it doesn't actually signify tempfail unless it happens to be 111. If your intent is to make it more clear what 111 means, use constant TEMPFAIL => 111; might be better?
  • I'm not sure $code >> 8 is necessarily non-zero if the child is killed by a signal. Perhaps we should exit (($code >> 8) || 1) to make sure we never accidentally exit 0 here?
  • The "complain but don't exit" version of die is warn, not print STDERR: we should either use warn (which includes the line number in the install script, fwiw) or add a \n to the end of the message.

The slurm part lgtm. :)

Actions #5

Updated by Tom Clegg over 9 years ago

a239e2d LGTM, thanks

Actions #6

Updated by Peter Amstutz over 9 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF