Project

General

Profile

Actions

Feature #18992

closed

Enable local keepstore on slurm/lsf if cluster config file already exists on compute node

Added by Tom Clegg 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/14/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently crunch-run only brings up a local keepstore process if it receives the cluster config on stdin, which currently only happens under arvados-dispatch-cloud. When using slurm or lsf, it is not enabled, and there is no error/warning saying why.

Proposed improvements:
  • If cluster config is not supplied on stdin, try reading it from /etc/arvados/config.yml (or a different value specified on the command line via CrunchRunArgumentsList config)
  • If local keepstore is enabled in config (LocalKeepBlobBuffersPerVCPU>0) but can't be brought up because cluster config file does not exist or cannot be read, log a message to that effect, and proceed using the usual keepstores
  • always log where it got the config from (stdin, somewhere on the file system, or didn't find it, so won't try to use it).
  • Explain in config.default.yml comments (and in upgrade notes) that the sysadmin is responsible for deploying the cluster config file to the compute nodes in order to use this feature with slurm or lsf

Files

crunch-run (22.9 MB) crunch-run arvados-server @ fb181ba27fd354e596d2216786ccee9a537bd0a3 Tom Clegg, 04/14/2022 06:18 PM
crunch-run (10.7 MB) crunch-run f6e8752348958e3bb48c7509a4ff78689f2d64c9 (size reduced with upx) Tom Clegg, 04/15/2022 07:35 PM

Subtasks 1 (0 open1 closed)

Task #19003: Review 18992-hpc-local-keepstoreResolvedPeter Amstutz04/14/2022

Actions

Related issues

Related to Arvados - Feature #16347: crunch-run runs local keepstoreResolvedTom Clegg10/08/2021

Actions
Actions #1

Updated by Tom Clegg 3 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 3 months ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Peter Amstutz 3 months ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
Actions #5

Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress
Actions #6

Updated by Tom Clegg 3 months ago

This was a bit awkward because
  • to set up keepstore, we need ctr.RuntimeConstraints.VCPUs
  • to get ctr.RuntimeConstraints.VCPUs, we need an ArvadosClient
  • the ArvadosClient we use for everything else can't be created until after the local keepstore is set up (otherwise ARVADOS_KEEP_SERVICES won't be set correctly and the local keepstore won't get used)

In a-d-c world, we get around this by passing the desired number of buffers (VCPUs × bufsPerVCPU) from the dispatcher on stdin, along with the cluster config.

Rather than update the lsf/slurm dispatchers to do that, it seemed more reasonable to create a separate ArvadosClient just for the purpose of fetching RuntimeConstraints so we can do this. This means we do two "get container" calls now (we re-fetch the whole thing a bit later in crunch-run initialization), which is not strictly optimal, but also not that big a deal, and I didn't want to let this change get too extensive over it.

Also fixed the cgroup func so its tests show failure messages when they fail, instead of aborting the whole test suite with a log.Fatal().

Also fixed the -broken-node-hook flag which (afaict) stopped working when we moved crunch-run into an arvados-server subcommand, because it was attached to the default flagset instead of the flagset we actually use to parse args.

18992-hpc-local-keepstore @ fb181ba27fd354e596d2216786ccee9a537bd0a3 -- developer-run-tests: #3050

Actions #7

Updated by Tom Clegg 3 months ago

Actions #8

Updated by Tom Clegg 3 months ago

18992-hpc-local-keepstore @ b45a44c93bcda724d862891eb2eed5666c8fd197 -- developer-run-tests: #3057

With a couple of logging fixes & tests, and updated info in the config default/template file.

Actions #9

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote:

18992-hpc-local-keepstore @ b45a44c93bcda724d862891eb2eed5666c8fd197 -- developer-run-tests: #3057

With a couple of logging fixes & tests, and updated info in the config default/template file.

LGTM

Actions #10

Updated by Tom Clegg 3 months ago

Actions #11

Updated by Tom Clegg 3 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #12

Updated by Peter Amstutz 2 months ago

  • Release set to 51
Actions #13

Updated by Ward Vandewege 2 months ago

Actions

Also available in: Atom PDF