Bug #10729


[Crunch2] Propagate error messages if sbatch command succeeds but crunch-run can't run (or can't log to the Arvados API)

Added by Tom Clegg over 6 years ago. Updated 3 months ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:
Release relationship:


Currently the only way to see what happened in these cases is by looking at slurm-XXXXX.out on the compute nodes, where XXXXX is the slurm job number. This is inconvenient or impossible for sysadmins, and impossible for regular users.

Example scenario: crunch-run is not installed on the compute node.

Related issues

Related to Arvados - Bug #10700: [Crunch2] crunch-dispatch-slurm pileupResolvedTom Clegg01/27/2017

Related to Arvados - Bug #11148: [Crunch2] Propagate dispatch error messages (e.g., sbatch fails) to user via logs/websocketNew02/21/2017

Actions #1

Updated by Tom Morris over 6 years ago

  • Target version set to 2017-03-15 sprint
Actions #2

Updated by Tom Clegg over 6 years ago

Thoughts about how to retrieve the error messages:
  • If the cluster is set up with shared home directories (which is common in slurm setups) then the messages might appear in $HOME/slurm-{jobid}.out
  • If we know which compute node tried to run the job, we might be able to run something like "srun --immediate --share --nodelist=computeX cat slurm-{jobid}.out"
    • If we don't know which node to look on, we might be able to look on all nodes using "srun --immediate --share -N{number_of_nodes} cat slurm-{jobid}.out".
    • Either way, success depends on cluster configuration. "The default shared behavior depends on system configuration and the partition's Shared option takes precedence over the job's option."
  • In the script we pass to sbatch, check for curl and wget -- if one exists, use it to do a "create log" API call using the contents of slurm-*.out
  • Use a bash trick to do an HTTP POST?
Actions #3

Updated by Tom Morris about 6 years ago

  • Target version changed from 2017-03-15 sprint to Arvados Future Sprints
Actions #4

Updated by Peter Amstutz about 6 years ago

We can also try running "srun" with the same flags, since it is more likely to return useful errors from the subprocess.

Actions #5

Updated by Tom Morris about 5 years ago

  • Release deleted (11)
Actions #6

Updated by Ward Vandewege almost 2 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #7

Updated by Peter Amstutz 3 months ago

  • Release set to 60

Also available in: Atom PDF