Bug #4494
closed[Crunch] Do more error-checking and show more diagnostic info when installing a Docker image
Description
Job qr1hi-8i9sb-vx84guvzp3xvwgz failed with this diagnostic output:
11/10/2014 5:05:18 PM crunch Version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 is commit 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 11/10/2014 5:05:18 PM crunch Run install script on all workers 11/10/2014 5:05:18 PM crunch Install script exited 1 11/10/2014 5:05:18 PM crunch Installing Docker image from 0b1b526683d86c41696eea9353ab5807+4242 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 603
Updated by Tim Pierce over 10 years ago
- Description updated (diff)
- Category set to Crunch
Updated by Bryan Cosca over 10 years ago
more examples: qr1hi-8i9sb-kryvvban6b7hj74 qr1hi-8i9sb-wdv358fgjhh8fsa
This is starting to be a small annoyance of arvados :( its not a super big deal because I can just re-run the job, but i feel that new users would get discouraged by this bug.
Updated by Tom Clegg over 10 years ago
- Subject changed from [Crunch] job fails to install Docker image to [Crunch] Do more error-checking and show more diagnostic info when installing a Docker image
- Story points set to 1.0
Updated by Ward Vandewege over 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Bryan Cosca over 10 years ago
This is starting to be bothersome, I've been trying to rerun this same job and I keep getting the error:
2ecb1a2a2b3574fb5c7fc0b1262c6c8c+83/qr1hi-8i9sb-lhf4q05ykxx6yzi.log.txt
2c7ad84b2506214b58f19e98f48d0731+83/qr1hi-8i9sb-fvcnteaq0z7xngb.log.txt
7163ba9c982dedb701a8b8c269bcd7ff+83/qr1hi-8i9sb-uqh0plj8644g9mp.log.txt
Updated by Bryan Cosca over 10 years ago
The jobs before in my previous update were able to be run successfully the next day. Today, this one is bothering me, I've ran it twice and ran into this issue:
qr1hi-8i9sb-vfrtoez7tgmybv0#Log
qr1hi-8i9sb-wl1oh0bbw072acd#Log
Updated by Tim Pierce over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Tim Pierce over 10 years ago
The logs suggest that this is related to #4482 (though it may be a red herring):
2014-11-17_16:25:59.30230 qr1hi-8i9sb-vfrtoez7tgmybv0 23751 Install script exited 1 2014-11-17_16:25:59.30237 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Task launch for 8476.2 failed on node compute9: User not found on host 2014-11-17_16:25:59.30246 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Application launch failed: User not found on host 2014-11-17_16:25:59.30253 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2014-11-17_16:25:59.30261 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute20]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 *** 2014-11-17_16:25:59.30270 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute29]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 *** 2014-11-17_16:25:59.30277 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute18]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 *** 2014-11-17_16:25:59.30288 qr1hi-8i9sb-vfrtoez7tgmybv0 ! slurmd[compute6]: error: *** STEP 8476.2 KILLED AT 2014-11-17T16:25:59 WITH SIGNAL 9 *** 2014-11-17_16:25:59.30300 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute29: task 4: Killed 2014-11-17_16:25:59.30307 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute20: task 3: Killed 2014-11-17_16:25:59.30315 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute6: task 0: Killed 2014-11-17_16:25:59.30323 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: compute18: task 2: Killed 2014-11-17_16:26:01.00771 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Timed out waiting for job step to complete 2014-11-17_16:26:01.33365 qr1hi-8i9sb-vfrtoez7tgmybv0 23751 Installing Docker image from 0b1b526683d86c41696eea9353ab5807+4242 exited 1 at /usr/local/arvados/src/sdk/cli/bin/crunch-job line 603
Updated by Bryan Cosca over 10 years ago
if it helps, ive found more jobs:
qr1hi-8i9sb-eipb3kuyac7absp#Log
qr1hi-8i9sb-5cb9i2keadpevcq#Log
Updated by Tim Pierce over 10 years ago
These jobs are failing for a variety of reasons, some of which look directly like compute node failures and some maybe not:
2014-11-17_18:25:22.70437 qr1hi-8i9sb-5cb9i2keadpevcq ! srun: error: Unable to resolve "compute28": Unknown host 2014-11-17_18:25:22.77063 qr1hi-8i9sb-eipb3kuyac7absp ! srun: error: Unable to create job step: Memory required by task is not available 2014-11-17_16:15:55.51721 qr1hi-8i9sb-wl1oh0bbw072acd ! srun: error: Task launch for 8473.0 failed on node compute9: User not found on host 2014-11-17_16:25:56.09087 qr1hi-8i9sb-vfrtoez7tgmybv0 ! srun: error: Task launch for 8476.0 failed on node compute9: User not found on host 2014-11-12_21:00:42.45803 qr1hi-8i9sb-kryvvban6b7hj74 ! srun: error: Task launch for 8344.0 failed on node compute3: User not found on host 2014-11-12_21:00:42.84486 qr1hi-8i9sb-wdv358fgjhh8fsa ! srun: error: Task launch for 8348.0 failed on node compute55: User not found on host
Updated by Tom Clegg over 10 years ago
- Target version changed from 2014-11-19 sprint to Arvados Future Sprints
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)