Enable local keepstore on slurm/lsf if cluster config file already exists on compute node
Currently crunch-run only brings up a local keepstore process if it receives the cluster config on stdin, which currently only happens under arvados-dispatch-cloud. When using slurm or lsf, it is not enabled, and there is no error/warning saying why.Proposed improvements:
- If cluster config is not supplied on stdin, try reading it from
/etc/arvados/config.yml(or a different value specified on the command line via CrunchRunArgumentsList config)
- If local keepstore is enabled in config (LocalKeepBlobBuffersPerVCPU>0) but can't be brought up because cluster config file does not exist or cannot be read, log a message to that effect, and proceed using the usual keepstores
- always log where it got the config from (stdin, somewhere on the file system, or didn't find it, so won't try to use it).
- Explain in config.default.yml comments (and in upgrade notes) that the sysadmin is responsible for deploying the cluster config file to the compute nodes in order to use this feature with slurm or lsf
Updated by Tom Clegg over 1 year ago
- to set up keepstore, we need ctr.RuntimeConstraints.VCPUs
- to get ctr.RuntimeConstraints.VCPUs, we need an ArvadosClient
- the ArvadosClient we use for everything else can't be created until after the local keepstore is set up (otherwise ARVADOS_KEEP_SERVICES won't be set correctly and the local keepstore won't get used)
In a-d-c world, we get around this by passing the desired number of buffers (VCPUs × bufsPerVCPU) from the dispatcher on stdin, along with the cluster config.
Rather than update the lsf/slurm dispatchers to do that, it seemed more reasonable to create a separate ArvadosClient just for the purpose of fetching RuntimeConstraints so we can do this. This means we do two "get container" calls now (we re-fetch the whole thing a bit later in crunch-run initialization), which is not strictly optimal, but also not that big a deal, and I didn't want to let this change get too extensive over it.
Also fixed the cgroup func so its tests show failure messages when they fail, instead of aborting the whole test suite with a log.Fatal().
Also fixed the
-broken-node-hook flag which (afaict) stopped working when we moved crunch-run into an arvados-server subcommand, because it was attached to the default flagset instead of the flagset we actually use to parse args.