Actions
Bug #9630
open[Crunch2] crunch-dispatch-slurm can't successfully dispatch work when run from a directory it can't write to
Story points:
0.5
Release:
Release relationship:
Auto
Description
This might be a problem in crunch-dispatch-slurm, in crunch-run, or the deployment of either. Given this runit script:
#!/bin/sh
set -e
exec 2>&1
user=crunch
envdir="$(pwd)/env"
exec chpst -e "$envdir" -u "$user" crunch-dispatch-slurm
and this container request:
{
"uuid":"9tee4-xvhdp-mh2vgj7x1ro4cnn",
"command":[
"true"
],
"container_count_max":null,
"container_image":"arvados/jobs:latest",
"container_uuid":"9tee4-dz642-x5xeob5l96qm44t",
"cwd":".",
"description":null,
"environment":{},
"expires_at":null,
"filters":null,
"mounts":{
"/out":{
"kind":"tmp",
"capacity":1000
}
},
"name":"Brett test 2016-07-19a",
"output_path":"/out",
"priority":1,
"properties":{},
"requesting_container_uuid":null,
"runtime_constraints":{
"ram":524288000,
"vcpus":1
},
"state":"Committed"
}
And this container:
{
"uuid":"9tee4-dz642-x5xeob5l96qm44t",
"command":[
"true"
],
"container_image":"arvados/jobs:latest",
"cwd":".",
"environment":{},
"exit_code":null,
"finished_at":null,
"locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n",
"log":null,
"mounts":{
"/out":{
"kind":"tmp",
"capacity":1000
}
},
"output":null,
"output_path":"/out",
"priority":1,
"progress":null,
"runtime_constraints":{
"ram":524288000,
"vcpus":1
},
"started_at":null,
"state":"Locked"
}
crunch-dispatch-slurm gets stuck in an infinite loop:
2016-07-19_14:51:57.21866 2016/07/19 14:51:57 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started 2016-07-19_14:52:07.05396 2016/07/19 14:52:07 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t 2016-07-19_14:52:07.06612 2016/07/19 14:52:07 sbatch succeeded: Submitted batch job 26 2016-07-19_14:52:17.08024 2016/07/19 14:52:17 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued. 2016-07-19_14:52:17.15452 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished 2016-07-19_14:52:17.31871 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started 2016-07-19_14:52:27.05337 2016/07/19 14:52:27 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t 2016-07-19_14:52:27.06501 2016/07/19 14:52:27 sbatch succeeded: Submitted batch job 27 2016-07-19_14:52:27.08553 2016/07/19 14:52:27 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued. 2016-07-19_14:52:27.16573 2016/07/19 14:52:27 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished
… and so on.
Actions