Feature #18992
closedEnable local keepstore on slurm/lsf if cluster config file already exists on compute node
Description
Currently crunch-run only brings up a local keepstore process if it receives the cluster config on stdin, which currently only happens under arvados-dispatch-cloud. When using slurm or lsf, it is not enabled, and there is no error/warning saying why.
Proposed improvements:- If cluster config is not supplied on stdin, try reading it from
/etc/arvados/config.yml
(or a different value specified on the command line via CrunchRunArgumentsList config) - If local keepstore is enabled in config (LocalKeepBlobBuffersPerVCPU>0) but can't be brought up because cluster config file does not exist or cannot be read, log a message to that effect, and proceed using the usual keepstores
- always log where it got the config from (stdin, somewhere on the file system, or didn't find it, so won't try to use it).
- Explain in config.default.yml comments (and in upgrade notes) that the sysadmin is responsible for deploying the cluster config file to the compute nodes in order to use this feature with slurm or lsf
Files
Updated by Peter Amstutz almost 3 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg almost 3 years ago
- to set up keepstore, we need ctr.RuntimeConstraints.VCPUs
- to get ctr.RuntimeConstraints.VCPUs, we need an ArvadosClient
- the ArvadosClient we use for everything else can't be created until after the local keepstore is set up (otherwise ARVADOS_KEEP_SERVICES won't be set correctly and the local keepstore won't get used)
In a-d-c world, we get around this by passing the desired number of buffers (VCPUs × bufsPerVCPU) from the dispatcher on stdin, along with the cluster config.
Rather than update the lsf/slurm dispatchers to do that, it seemed more reasonable to create a separate ArvadosClient just for the purpose of fetching RuntimeConstraints so we can do this. This means we do two "get container" calls now (we re-fetch the whole thing a bit later in crunch-run initialization), which is not strictly optimal, but also not that big a deal, and I didn't want to let this change get too extensive over it.
Also fixed the cgroup func so its tests show failure messages when they fail, instead of aborting the whole test suite with a log.Fatal().
Also fixed the -broken-node-hook
flag which (afaict) stopped working when we moved crunch-run into an arvados-server subcommand, because it was attached to the default flagset instead of the flagset we actually use to parse args.
18992-hpc-local-keepstore @ fb181ba27fd354e596d2216786ccee9a537bd0a3 -- developer-run-tests: #3050
Updated by Tom Clegg almost 3 years ago
18992-hpc-local-keepstore @ b45a44c93bcda724d862891eb2eed5666c8fd197 -- developer-run-tests: #3057
With a couple of logging fixes & tests, and updated info in the config default/template file.
Updated by Peter Amstutz almost 3 years ago
Tom Clegg wrote:
18992-hpc-local-keepstore @ b45a44c93bcda724d862891eb2eed5666c8fd197 -- developer-run-tests: #3057
With a couple of logging fixes & tests, and updated info in the config default/template file.
LGTM
Updated by Tom Clegg almost 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|f6e8752348958e3bb48c7509a4ff78689f2d64c9.
Updated by Ward Vandewege almost 3 years ago
- Related to Feature #16347: crunch-run runs local keepstore added