Bug #4494

[Crunch] Do more error-checking and show more diagnostic info when installing a Docker image

Added by Tim Pierce over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
1.0

Description

Job qr1hi-8i9sb-vx84guvzp3xvwgz failed with this diagnostic output:

11/10/2014 5:05:18 PM    crunch    Version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 is commit 83a9390a05bbffc2e4ea95dd693af3ab3547fa12
11/10/2014 5:05:18 PM    crunch    Run install script on all workers
11/10/2014 5:05:18 PM    crunch    Install script exited 1
11/10/2014 5:05:18 PM    crunch    Installing Docker image from 0b1b526683d86c41696eea9353ab5807+4242 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 603

Related issues

Related to Arvados - Bug #4482: [Crunch] Diagnostics failure "user not found on host"Resolved

History

#1 Updated by Tim Pierce over 4 years ago

  • Description updated (diff)
  • Category set to Crunch

#2 Updated by Bryan Cosca over 4 years ago

more examples: qr1hi-8i9sb-kryvvban6b7hj74 qr1hi-8i9sb-wdv358fgjhh8fsa

This is starting to be a small annoyance of arvados :( its not a super big deal because I can just re-run the job, but i feel that new users would get discouraged by this bug.

#3 Updated by Tom Clegg over 4 years ago

  • Subject changed from [Crunch] job fails to install Docker image to [Crunch] Do more error-checking and show more diagnostic info when installing a Docker image
  • Story points set to 1.0

#4 Updated by Ward Vandewege over 4 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#5 Updated by Bryan Cosca over 4 years ago

This is starting to be bothersome, I've been trying to rerun this same job and I keep getting the error:

2ecb1a2a2b3574fb5c7fc0b1262c6c8c+83/qr1hi-8i9sb-lhf4q05ykxx6yzi.log.txt
2c7ad84b2506214b58f19e98f48d0731+83/qr1hi-8i9sb-fvcnteaq0z7xngb.log.txt
7163ba9c982dedb701a8b8c269bcd7ff+83/qr1hi-8i9sb-uqh0plj8644g9mp.log.txt

#6 Updated by Bryan Cosca over 4 years ago

The jobs before in my previous update were able to be run successfully the next day. Today, this one is bothering me, I've ran it twice and ran into this issue:

qr1hi-8i9sb-vfrtoez7tgmybv0#Log
qr1hi-8i9sb-wl1oh0bbw072acd#Log

#7 Updated by Tim Pierce over 4 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint

#8 Updated by Tim Pierce over 4 years ago

The logs suggest that this is related to #4482 (though it may be a red herring):

2014-11-17_16:25:59.30230 qr1hi-8i9sb-vfrtoez7tgmybv0 23751  Install script exited 1
2014-11-17_16:25:59.30237 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Task launch for 8476.2 failed on node compute9: User not found on host
2014-11-17_16:25:59.30246 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Application launch failed: User not found on host
2014-11-17_16:25:59.30253 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2014-11-17_16:25:59.30261 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute20]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 ***
2014-11-17_16:25:59.30270 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute29]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 ***
2014-11-17_16:25:59.30277 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute18]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 ***
2014-11-17_16:25:59.30288 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute6]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 ***
2014-11-17_16:25:59.30300 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute29: task 4: Killed
2014-11-17_16:25:59.30307 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute20: task 3: Killed
2014-11-17_16:25:59.30315 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute6: task 0: Killed
2014-11-17_16:25:59.30323 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute18: task 2: Killed
2014-11-17_16:26:01.00771 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Timed out waiting for job step to complete
2014-11-17_16:26:01.33365 qr1hi-8i9sb-vfrtoez7tgmybv0 23751  Installing Docker image from 0b1b526683d86c41696eea9353ab5807+4242 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 603

#9 Updated by Bryan Cosca over 4 years ago

if it helps, ive found more jobs:

qr1hi-8i9sb-eipb3kuyac7absp#Log
qr1hi-8i9sb-5cb9i2keadpevcq#Log

#10 Updated by Tim Pierce over 4 years ago

These jobs are failing for a variety of reasons, some of which look directly like compute node failures and some maybe not:

2014-11-17_18:25:22.70437 qr1hi-8i9sb-5cb9i2keadpevcq ! srun: error: Unable to resolve "compute28": Unknown host
2014-11-17_18:25:22.77063 qr1hi-8i9sb-eipb3kuyac7absp ! srun: error: Unable to create job step: Memory required by task is not available
2014-11-17_16:15:55.51721 qr1hi-8i9sb-wl1oh0bbw072acd ! srun: error: Task launch for 8473.0 failed on node compute9: User not found on host
2014-11-17_16:25:56.09087 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Task launch for 8476.0 failed on node compute9: User not found on host
2014-11-12_21:00:42.45803 qr1hi-8i9sb-kryvvban6b7hj74 ! srun: error: Task launch for 8344.0 failed on node compute3: User not found on host
2014-11-12_21:00:42.84486 qr1hi-8i9sb-wdv358fgjhh8fsa ! srun: error: Task launch for 8348.0 failed on node compute55: User not found on host

#11 Updated by Tom Clegg over 4 years ago

  • Target version changed from 2014-11-19 sprint to Arvados Future Sprints

Also available in: Atom PDF