Bug #10729

[Crunch2] Propagate error messages if sbatch command succeeds but crunch-run can't run (or can't log to the Arvados API)

Added by Tom Clegg 3 months ago. Updated 28 days ago.

Status:NewStart date:12/14/2016
Priority:NormalDue date:
Assignee:-% Done:


Target version:Arvados Future Sprints
Story points-
Velocity based estimate-
ReleaseCrunch v2


Currently the only way to see what happened in these cases is by looking at slurm-XXXXX.out on the compute nodes, where XXXXX is the slurm job number. This is inconvenient or impossible for sysadmins, and impossible for regular users.

Example scenario: crunch-run is not installed on the compute node.

Related issues

Related to Arvados - Bug #10700: [Crunch2] crunch-dispatch-slurm pileup Resolved 01/27/2017
Related to Arvados - Bug #11148: [Crunch2] Propagate dispatch error messages (e.g., sbatch... New 02/21/2017


#1 Updated by Tom Morris about 1 month ago

  • Target version set to 2017-03-15 sprint

#2 Updated by Tom Clegg about 1 month ago

Thoughts about how to retrieve the error messages:
  • If the cluster is set up with shared home directories (which is common in slurm setups) then the messages might appear in $HOME/slurm-{jobid}.out
  • If we know which compute node tried to run the job, we might be able to run something like "srun --immediate --share --nodelist=computeX cat slurm-{jobid}.out"
    • If we don't know which node to look on, we might be able to look on all nodes using "srun --immediate --share -N{number_of_nodes} cat slurm-{jobid}.out".
    • Either way, success depends on cluster configuration. "The default shared behavior depends on system configuration and the partition's Shared option takes precedence over the job's option."
  • In the script we pass to sbatch, check for curl and wget -- if one exists, use it to do a "create log" API call using the contents of slurm-*.out
  • Use a bash trick to do an HTTP POST? http://unix.stackexchange.com/questions/83926/how-to-download-a-file-using-just-bash-and-nothing-else-no-curl-wget-perl-et

#3 Updated by Tom Morris 28 days ago

  • Target version changed from 2017-03-15 sprint to Arvados Future Sprints

#4 Updated by Peter Amstutz 28 days ago

We can also try running "srun" with the same flags, since it is more likely to return useful errors from the subprocess.

Also available in: Atom PDF