Bug #9630

Updated by Brett Smith about 4 years ago

This might be a problem in crunch-dispatch-slurm, in crunch-run, or the deployment of either. Given this runit script:

<pre><code class="sh">#!/bin/sh
set -e
exec 2>&1

user=crunch
envdir="$(pwd)/env"
exec chpst -e "$envdir" -u "$user" crunch-dispatch-slurm
</code></pre>

and this
container request:

<pre><code class="json">{
"uuid":"9tee4-xvhdp-mh2vgj7x1ro4cnn",
"command":[
"true"
],
"container_count_max":null,
"container_image":"arvados/jobs:latest",
"container_uuid":"9tee4-dz642-x5xeob5l96qm44t",
"cwd":".",
"description":null,
"environment":{},
"expires_at":null,
"filters":null,
"mounts":{
"/out":{
"kind":"tmp",
"capacity":1000
}
},
"name":"Brett test 2016-07-19a",
"output_path":"/out",
"priority":1,
"properties":{},
"requesting_container_uuid":null,
"runtime_constraints":{
"ram":524288000,
"vcpus":1
},
"state":"Committed"
}
</code></pre>

And this container:

<pre><code class="json">{
"uuid":"9tee4-dz642-x5xeob5l96qm44t",
"command":[
"true"
],
"container_image":"arvados/jobs:latest",
"cwd":".",
"environment":{},
"exit_code":null,
"finished_at":null,
"locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n",
"log":null,
"mounts":{
"/out":{
"kind":"tmp",
"capacity":1000
}
},
"output":null,
"output_path":"/out",
"priority":1,
"progress":null,
"runtime_constraints":{
"ram":524288000,
"vcpus":1
},
"started_at":null,
"state":"Locked"
}
</code></pre>

crunch-dispatch-slurm gets stuck in an infinite loop:

<pre>2016-07-19_14:51:57.21866 2016/07/19 14:51:57 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:07.05396 2016/07/19 14:52:07 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:07.06612 2016/07/19 14:52:07 sbatch succeeded: Submitted batch job 26
2016-07-19_14:52:17.08024 2016/07/19 14:52:17 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:17.15452 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished
2016-07-19_14:52:17.31871 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:27.05337 2016/07/19 14:52:27 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:27.06501 2016/07/19 14:52:27 sbatch succeeded: Submitted batch job 27
2016-07-19_14:52:27.08553 2016/07/19 14:52:27 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:27.16573 2016/07/19 14:52:27 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished
</pre>

… and so on.

Back