Project

General

Profile

Actions

Bug #5943

closed

User uuid changed during a pipeline run

Added by Bryan Cosca almost 9 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

qr1hi-d1hrv-bgj6vtddg9cmyyt looks like it failed due to nodes down:

2015-05-07_18:33:05 salloc: error: Failed to allocate resources: Required node not available (down or drained)
2015-05-07_18:34:44 salloc: Granted job allocation 1226
2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150205181653, 0.1.20150128223752, 0.1.20150121183928, 0.1.20141209151444, 0.1.20141014201516, 0.1.20140919104705, 0.1.20140905165259, 0.1.20140827170424, 0.1.20140825141611, 0.1.20140812162850, 0.1.20140708213257, 0.1.20140707162447, 0.1.20140630151639, 0.1.20140513131358, 0.1.20140513101345, 0.1.20140414145041
2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 check slurm allocation
2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 node compute26 - 1 slots
2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 start
2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Clean work dirs
2015-05-07_18:34:46 starting: ['srun','--nodelist=compute26','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2015-05-07_18:34:48 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Cleanup command exited 0
2015-05-07_18:34:48 starting: ['srun','--nodelist=compute26','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8; then arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | /usr/bin/docker.io load fi ']
2015-05-07_18:38:31 starting: ['srun','--nodelist=compute26','/bin/sh','-ec','/usr/bin/docker.io run --help | grep -qe --memory-swap=']
2015-05-07_18:38:31 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Packing Arvados SDK version f0fe7273c1851cb93e9edd58c0b60d3590b222ed for installation
2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Looking for version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 from repository arvados
2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Using local repository '/var/lib/arvados/internal.git'
2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 is commit 83a9390a05bbffc2e4ea95dd693af3ab3547fa12
2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Run install script on all workers
2015-05-07_18:38:32 starting: ['srun','--nodelist=compute26','-D','/tmp','--job-name=qr1hi-8i9sb-v2ojzh70vxl2xhh','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -']
2015-05-07_18:38:33 srun: error: compute26: task 0: Exited with exit code 141
2015-05-07_18:38:33 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317 Install script exited 141
2015-05-07_18:38:33 salloc: Relinquishing job allocation 1226

This might also help:

I also thought it was weird that the modified_by_user_uuid changed from qr1hi-tpzed-vm0nd4e7a013f8e to qr1hi-tpzed-000000000000000

I should also note that it would be really helpful if this job would re-run by itself if the node went down.

Actions #1

Updated by Bryan Cosca almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF