Bug #10729

[Crunch2] Propagate error messages if sbatch command succeeds but crunch-run can't run (or can't log to the Arvados API)

Added by Tom Clegg about 1 year ago. Updated 12 months ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:
Release relationship:


Currently the only way to see what happened in these cases is by looking at slurm-XXXXX.out on the compute nodes, where XXXXX is the slurm job number. This is inconvenient or impossible for sysadmins, and impossible for regular users.

Example scenario: crunch-run is not installed on the compute node.

Related issues

Related to Arvados - Bug #10700: [Crunch2] crunch-dispatch-slurm pileupResolved2017-01-27

Related to Arvados - Bug #11148: [Crunch2] Propagate dispatch error messages (e.g., sbatch fails) to user via logs/websocketNew2017-02-21


#1 Updated by Tom Morris about 1 year ago

  • Target version set to 2017-03-15 sprint

#2 Updated by Tom Clegg 12 months ago

Thoughts about how to retrieve the error messages:
  • If the cluster is set up with shared home directories (which is common in slurm setups) then the messages might appear in $HOME/slurm-{jobid}.out
  • If we know which compute node tried to run the job, we might be able to run something like "srun --immediate --share --nodelist=computeX cat slurm-{jobid}.out"
    • If we don't know which node to look on, we might be able to look on all nodes using "srun --immediate --share -N{number_of_nodes} cat slurm-{jobid}.out".
    • Either way, success depends on cluster configuration. "The default shared behavior depends on system configuration and the partition's Shared option takes precedence over the job's option."
  • In the script we pass to sbatch, check for curl and wget -- if one exists, use it to do a "create log" API call using the contents of slurm-*.out
  • Use a bash trick to do an HTTP POST? http://unix.stackexchange.com/questions/83926/how-to-download-a-file-using-just-bash-and-nothing-else-no-curl-wget-perl-et

#3 Updated by Tom Morris 12 months ago

  • Target version changed from 2017-03-15 sprint to Arvados Future Sprints

#4 Updated by Peter Amstutz 12 months ago

We can also try running "srun" with the same flags, since it is more likely to return useful errors from the subprocess.

Also available in: Atom PDF