Bug #9630

[Crunch2] crunch-dispatch-slurm can't successfully dispatch work when run from a directory it can't write to

Added by Brett Smith almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
07/19/2016
Due date:
% Done:

0%

Estimated time:
Story points:
0.5

Description

This might be a problem in crunch-dispatch-slurm, in crunch-run, or the deployment of either. Given this runit script:

#!/bin/sh
set -e
exec 2>&1

user=crunch
envdir="$(pwd)/env" 
exec chpst -e "$envdir" -u "$user" crunch-dispatch-slurm

and this container request:

{
 "uuid":"9tee4-xvhdp-mh2vgj7x1ro4cnn",
 "command":[
  "true" 
 ],
 "container_count_max":null,
 "container_image":"arvados/jobs:latest",
 "container_uuid":"9tee4-dz642-x5xeob5l96qm44t",
 "cwd":".",
 "description":null,
 "environment":{},
 "expires_at":null,
 "filters":null,
 "mounts":{
  "/out":{
   "kind":"tmp",
   "capacity":1000
  }
 },
 "name":"Brett test 2016-07-19a",
 "output_path":"/out",
 "priority":1,
 "properties":{},
 "requesting_container_uuid":null,
 "runtime_constraints":{
  "ram":524288000,
  "vcpus":1
 },
 "state":"Committed" 
}

And this container:

{
 "uuid":"9tee4-dz642-x5xeob5l96qm44t",
 "command":[
  "true" 
 ],
 "container_image":"arvados/jobs:latest",
 "cwd":".",
 "environment":{},
 "exit_code":null,
 "finished_at":null,
 "locked_by_uuid":"9tee4-gj3su-n39vrgwxelusj7n",
 "log":null,
 "mounts":{
  "/out":{
   "kind":"tmp",
   "capacity":1000
  }
 },
 "output":null,
 "output_path":"/out",
 "priority":1,
 "progress":null,
 "runtime_constraints":{
  "ram":524288000,
  "vcpus":1
 },
 "started_at":null,
 "state":"Locked" 
}

crunch-dispatch-slurm gets stuck in an infinite loop:

2016-07-19_14:51:57.21866 2016/07/19 14:51:57 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:07.05396 2016/07/19 14:52:07 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:07.06612 2016/07/19 14:52:07 sbatch succeeded: Submitted batch job 26
2016-07-19_14:52:17.08024 2016/07/19 14:52:17 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:17.15452 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished
2016-07-19_14:52:17.31871 2016/07/19 14:52:17 Monitoring container 9tee4-dz642-x5xeob5l96qm44t started
2016-07-19_14:52:27.05337 2016/07/19 14:52:27 About to submit queued container 9tee4-dz642-x5xeob5l96qm44t
2016-07-19_14:52:27.06501 2016/07/19 14:52:27 sbatch succeeded: Submitted batch job 27
2016-07-19_14:52:27.08553 2016/07/19 14:52:27 Container 9tee4-dz642-x5xeob5l96qm44t in state Locked but missing from slurm queue, changing to Queued.
2016-07-19_14:52:27.16573 2016/07/19 14:52:27 Monitoring container 9tee4-dz642-x5xeob5l96qm44t finished

… and so on.

History

#1 Updated by Tom Clegg almost 4 years ago

I think this is exactly what crunch-dispatch looks like if crunch-run isn't installed on the compute nodes. Even though I don't think that's the problem in this particular case, we should figure out a better way to handle that situation.

This might turn up some clues:

crunch@compute0$ crunch-run 9tee4-dz642-x5xeob5l96qm44t

#2 Updated by Brett Smith almost 4 years ago

I noticed that the container's container_image was not content addressed. I upgraded the API server on the cluster to get #8470. I confirmed that took effect with a new container request and container, but crunch-dispatch-slurm is still looping.

#3 Updated by Brett Smith almost 4 years ago

Tom Clegg wrote:

This might turn up some clues:

That works if I give it the same ARVADOS environment variables crunch-dispatch-slurm has. Maybe those aren't getting passed through?

#4 Updated by Brett Smith almost 4 years ago

I'm fairly sure one problem is that the script crunch-dispatch-slurm hands to sbatch does not use srun. It may be trying to run crunch-run on the dispatch node, rather than the compute node.

#5 Updated by Brett Smith almost 4 years ago

Brett Smith wrote:

I'm fairly sure one problem is that the script crunch-dispatch-slurm hands to sbatch does not use srun. It may be trying to run crunch-run on the dispatch node, rather than the compute node.

This is not right either. As Tom pointed out from the sbatch man page, the script runs on the first allocated compute node.

This was resolvable just through ops: I extended the runit script to cd /home/crunch before starting crunch-dispatch-slurm. With that change, the entire dispatch pipeline started working as intended.

Apparently running sbatch from a directory where the running user can't write (either on the dispatch node, or on the compute node—not sure) means that the sbatch succeeds, but the job fails. Tom suggested in IRC we might consider using sbatch -D to avoid this problem. Or maybe this is something we should cover and be clear about in the Install Guide. Lots of possible approaches.

I'm leaving this open to discuss how we want to settle this.

#6 Updated by Brett Smith almost 4 years ago

  • Subject changed from [Crunch2] Container dispatch stuck in an infinite loop to [Crunch2] crunch-dispatch-slurm can't successfully dispatch work when run from a directory it can't write to

#7 Updated by Brett Smith almost 4 years ago

  • Description updated (diff)

#8 Updated by Tom Clegg almost 4 years ago

Suggestions:
  • In crunch-dispatch-slurm, print a warning if cwd is not writable on the dispatch host. This doesn't actually mean the same cwd won't be writable on the compute nodes, so it can't be an error, but it seems like an easy check that will be accurate in a lot of common setups.
  • cd /tmp in the install guide, with a comment explaining it needs to be writable by the crunch user on all compute nodes, and will accumulate slurm-*.out files.

#9 Updated by Tom Morris almost 4 years ago

  • Story points set to 0.5

Also available in: Atom PDF