Project

General

Profile

Actions

Bug #9630

open

[Crunch2] crunch-dispatch-slurm can't successfully dispatch work when run from a directory it can't write to

Added by Brett Smith almost 8 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
0.5
Release:
Release relationship:
Auto

Description

This might be a problem in crunch-dispatch-slurm, in crunch-run, or the deployment of either. Given this runit script:

#!/bin/sh
set -e
exec 2>&1

user=crunch
envdir="$(pwd)/env" 
exec chpst -e "$envdir" -u "$user" crunch-dispatch-slurm

and this container request:

{
 "uuid":"9tee4-xvhdp-mh2vgj7x1ro4cnn",
 "command":[
  "true" 
 ],
 "container_count_max":null,
 "container_image":"arvados/jobs:latest",
 "container_uuid":"9tee4-dz642-x5xeob5l96qm44t",
 "cwd":".",
 "description":null,
 "environment":{},
 "expires_at":null,
 "filters":null,
 "mounts":{
  "/out":{
   "kind":"tmp",
   "capacity":1000
  }
 },
 "name":"Brett test 2016-07-19a",
 "output_path":"/out",
 "priority":1,
 "properties":{},
 "requesting_container_uuid":null,
 "runtime_constraints":{
  "ram":524288000,
  "vcpus":1
 },
 "state":"Committed" 
}

And this container:

{
 "uuid":"9tee4-dz642-x5xeob5l96qm44t",
 "command":[
  "true" 
 ],
 "container_image":"arvados/jobs:latest",
 "cwd":".",
 "environment":{},
 "exit_code":null,
 "finished_at":null,
 "locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n",
 "log":null,
 "mounts":{
  "/out":{
   "kind":"tmp",
   "capacity":1000
  }
 },
 "output":null,
 "output_path":"/out",
 "priority":1,
 "progress":null,
 "runtime_constraints":{
  "ram":524288000,
  "vcpus":1
 },
 "started_at":null,
 "state":"Locked" 
}

crunch-dispatch-slurm gets stuck in an infinite loop:

2016-07-19_14:51:57.21866 2016/07/19 14:51:57 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:07.05396 2016/07/19 14:52:07 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:07.06612 2016/07/19 14:52:07 sbatch succeeded: Submitted batch job 26
2016-07-19_14:52:17.08024 2016/07/19 14:52:17 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:17.15452 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished
2016-07-19_14:52:17.31871 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:27.05337 2016/07/19 14:52:27 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:27.06501 2016/07/19 14:52:27 sbatch succeeded: Submitted batch job 27
2016-07-19_14:52:27.08553 2016/07/19 14:52:27 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:27.16573 2016/07/19 14:52:27 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished

… and so on.

Actions

Also available in: Atom PDF