Bug #10729
open
[Crunch2] Propagate error messages if sbatch command succeeds but crunch-run can't run (or can't log to the Arvados API)
Added by Tom Clegg about 8 years ago.
Updated 10 months ago.
Release relationship:
Auto
Description
Currently the only way to see what happened in these cases is by looking at slurm-XXXXX.out on the compute nodes, where XXXXX is the slurm job number. This is inconvenient or impossible for sysadmins, and impossible for regular users.
Example scenario: crunch-run is not installed on the compute node.
- Target version set to 2017-03-15 sprint
Thoughts about how to retrieve the error messages:
- If the cluster is set up with shared home directories (which is common in slurm setups) then the messages might appear in
$HOME/slurm-{jobid}.out
- If we know which compute node tried to run the job, we might be able to run something like
"srun --immediate --share --nodelist=computeX cat slurm-{jobid}.out"
- If we don't know which node to look on, we might be able to look on all nodes using
"srun --immediate --share -N{number_of_nodes} cat slurm-{jobid}.out"
.
- Either way, success depends on cluster configuration. "The default shared behavior depends on system configuration and the partition's Shared option takes precedence over the job's option."
- In the script we pass to sbatch, check for curl and wget -- if one exists, use it to do a "create log" API call using the contents of slurm-*.out
- Use a bash trick to do an HTTP POST? http://unix.stackexchange.com/questions/83926/how-to-download-a-file-using-just-bash-and-nothing-else-no-curl-wget-perl-et
- Target version changed from 2017-03-15 sprint to Arvados Future Sprints
We can also try running "srun" with the same flags, since it is more likely to return useful errors from the subprocess.
- Target version deleted (
Arvados Future Sprints)
- Target version set to Future
Also available in: Atom
PDF