Project

General

Profile

Bug #9630

Updated by Brett Smith over 7 years ago

This might be a problem in crunch-dispatch-slurm, in crunch-run, or the deployment of either.    Given this runit script: 

 <pre><code class="sh">#!/bin/sh 
 set -e 
 exec 2>&1 

 user=crunch 
 envdir="$(pwd)/env" 
 exec chpst -e "$envdir" -u "$user" crunch-dispatch-slurm 
 </code></pre> 

 and this container request: 

 <pre><code class="json">{ 
  "uuid":"9tee4-xvhdp-mh2vgj7x1ro4cnn", 
  "command":[ 
   "true" 
  ], 
  "container_count_max":null, 
  "container_image":"arvados/jobs:latest", 
  "container_uuid":"9tee4-dz642-x5xeob5l96qm44t", 
  "cwd":".", 
  "description":null, 
  "environment":{}, 
  "expires_at":null, 
  "filters":null, 
  "mounts":{ 
   "/out":{ 
    "kind":"tmp", 
    "capacity":1000 
   } 
  }, 
  "name":"Brett test 2016-07-19a", 
  "output_path":"/out", 
  "priority":1, 
  "properties":{}, 
  "requesting_container_uuid":null, 
  "runtime_constraints":{ 
   "ram":524288000, 
   "vcpus":1 
  }, 
  "state":"Committed" 
 } 
 </code></pre> 

 And this container: 

 <pre><code class="json">{ 
  "uuid":"9tee4-dz642-x5xeob5l96qm44t", 
  "command":[ 
   "true" 
  ], 
  "container_image":"arvados/jobs:latest", 
  "cwd":".", 
  "environment":{}, 
  "exit_code":null, 
  "finished_at":null, 
  "locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n", 
  "log":null, 
  "mounts":{ 
   "/out":{ 
    "kind":"tmp", 
    "capacity":1000 
   } 
  }, 
  "output":null, 
  "output_path":"/out", 
  "priority":1, 
  "progress":null, 
  "runtime_constraints":{ 
   "ram":524288000, 
   "vcpus":1 
  }, 
  "started_at":null, 
  "state":"Locked" 
 } 
 </code></pre> 

 crunch-dispatch-slurm gets stuck in an infinite loop: 

 <pre>2016-07-19_14:51:57.21866 2016/07/19 14:51:57 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started 
 2016-07-19_14:52:07.05396 2016/07/19 14:52:07 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t 
 2016-07-19_14:52:07.06612 2016/07/19 14:52:07 sbatch succeeded: Submitted batch job 26 
 2016-07-19_14:52:17.08024 2016/07/19 14:52:17 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued. 
 2016-07-19_14:52:17.15452 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished 
 2016-07-19_14:52:17.31871 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started 
 2016-07-19_14:52:27.05337 2016/07/19 14:52:27 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t 
 2016-07-19_14:52:27.06501 2016/07/19 14:52:27 sbatch succeeded: Submitted batch job 27 
 2016-07-19_14:52:27.08553 2016/07/19 14:52:27 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued. 
 2016-07-19_14:52:27.16573 2016/07/19 14:52:27 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished 
 </pre> 

 … and so on.

Back