Feature #14714

[keep] keep-balance uses cluster config file

Added by Peter Amstutz 8 months ago. Updated 1 day ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Should include: Should not include:
  • Load keep services list from config file instead of arvados/v1/keep_services endpoint (seems unsafe to do this until we can be assured the config is fully migrated; meanwhile, the keep_services endpoint is safe to use throughout the migration)
  • Rendezvous by volume instead of by server (see #15641)

Subtasks

Task #15480: ReviewNewPeter Amstutz


Related issues

Related to Arvados - Story #13648: [Epic] Use one cluster configuration file for all componentsNew

Related to Arvados - Feature #9255: [keep] drain mode for a keepstore serviceNew05/23/2016

History

#1 Updated by Peter Amstutz 8 months ago

  • Related to Story #13648: [Epic] Use one cluster configuration file for all components added

#2 Updated by Peter Amstutz 8 months ago

  • Tracker changed from Bug to Feature

#3 Updated by Lucas Di Pentima 8 months ago

Based on documentation at https://doc.arvados.org/install/install-keep-balance.html

As I believe there's only one instance of keep-balance per cluster, I think it would be appropriate to add its specific configs inside a NodeProfile instead of having a separate section as the dispatchers on #14713

Timers:
  KeepbalanceRunPeriod: 10m
NodeProfiles:
  keep:
    Keepbalance:
      Listen: :9005
      ManagementToken: xyzzy
      CollectionBatchSize: 100000
      CollectionBuffers: 1000
      KeepServiceTypes:
        1: disk

#4 Updated by Tom Morris 7 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 1.0

#5 Updated by Tom Morris 2 months ago

  • Target version changed from Arvados Future Sprints to 2019-07-31 Sprint

#6 Updated by Lucas Di Pentima 2 months ago

  • Assigned To set to Lucas Di Pentima

#7 Updated by Lucas Di Pentima about 2 months ago

  • Target version changed from 2019-07-31 Sprint to 2019-08-14 Sprint

#8 Updated by Eric Biagiotti about 2 months ago

  • Assigned To changed from Lucas Di Pentima to Eric Biagiotti

#9 Updated by Eric Biagiotti about 1 month ago

  • Status changed from New to In Progress

#10 Updated by Eric Biagiotti about 1 month ago

  • Target version changed from 2019-08-14 Sprint to 2019-08-28 Sprint

#11 Updated by Eric Biagiotti about 1 month ago

  • Status changed from In Progress to New

#12 Updated by Eric Biagiotti 21 days ago

  • Target version changed from 2019-08-28 Sprint to 2019-09-11 Sprint

#13 Updated by Eric Biagiotti 19 days ago

Some questions/comments about keep-balance flags and config options.

Config

  • KeepServiceTypes: In the config wiki KeepServiceTypes is mapped to Volumes. I'm assuming this is meant to map to Volume.Driver types? This seems contingent on keepstore cluster config work.
  • CollectionBatchSize/CollectionBuffers: These are both mapped to API.MaxItemsPerResponse on the wiki, but it seems like we would be removing potentially useful resource usage tweaking. Are we sure we want to simplify this? See keep-balance/usage.go for more info.
  • LostBlockFile: Not on the wiki, but Collections.KeepBalanceLostBlockFile would be a good place unless we want to make a new KeepBalance section in the config.

Flags

I plan on keeping all the flag options since keep-balance can be run once instead of as a service.

  • KeepServiceList: Right now this is only a command line option. Unless we think its worth specifying a set of keep services to balance in the config, this will stay the same.
  • commit-pulls/commit-trash: These are mapped to Collections.BlobReplicateConcurrency/Collections.BlobTrashConcurrency respectively, but we might want to consider settings these to false by default if keep-balance is run with --once to avoid a one-time op accidentally committing changes. We could also require these to be set explicitly if --once is used.

Docs

There is lots of good info in keep-balance/usage.go. Was planning on putting most of it in the install doc, but maybe a user guide page is more appropriate?

#14 Updated by Eric Biagiotti 19 days ago

  • Status changed from New to In Progress

#15 Updated by Tom Clegg 15 days ago

  • KeepServiceTypes: In the config wiki KeepServiceTypes is mapped to Volumes. I'm assuming this is meant to map to Volume.Driver types? This seems contingent on keepstore cluster config work.

KeepServiceTypes supports filtering by service_type in the keep_services table, typically "disk" or "proxy". KeepServiceList supports using a specified (cached/fake/customized) set of keep_services rows.

I think we still need to support KeepServiceTypes until everyone has migrated their keepstore configs. After that,
  • keepstore/keepproxy server addresses will be listed separately in the Services section typical installs won't need to specify KeepServiceTypes.
  • for debugging/special situations, the list of servers can be controlled by using an altered version of the cluster config file.

We should check with ops to see whether there's still a need for specifying a subset or alternate list of services. If so, it should probably be done with per-volume flags (enable pull/trash) rather than per-server.

  • CollectionBatchSize/CollectionBuffers: These are both mapped to API.MaxItemsPerResponse on the wiki, but it seems like we would be removing potentially useful resource usage tweaking. Are we sure we want to simplify this? See keep-balance/usage.go for more info.

CollectionBatchSize might not be especially useful (keep-balance uses much less memory than apiserver for a given page size anyway). CollectionBuffers is hard to use effectively (anything far enough from 0 to affect performance uses arbitrary amounts of memory, and performance impact is minimal anyway).

That said, yes, at least for now let's just move these to Collections.BalanceCollectionBatch and Collections.BalanceCollectionBuffers.

  • LostBlockFile: Not on the wiki, but Collections.KeepBalanceLostBlockFile would be a good place unless we want to make a new KeepBalance section in the config.

Added Collections.BlobMissingReport to wiki.

I plan on keeping all the flag options since keep-balance can be run once instead of as a service.

  • KeepServiceList: Right now this is only a command line option. Unless we think its worth specifying a set of keep services to balance in the config, this will stay the same.

Currently KeepServiceList can also be given literally in the keep-balance config file though, right? (see above re keeping/dropping)

  • commit-pulls/commit-trash: These are mapped to Collections.BlobReplicateConcurrency/Collections.BlobTrashConcurrency respectively, but we might want to consider settings these to false by default if keep-balance is run with --once to avoid a one-time op accidentally committing changes. We could also require these to be set explicitly if --once is used.

If we were starting fresh I'd say a "-n" (dry run) flag would be good -- but we're not, and changing the default from "don't commit" to "commit" seems iffy. Perhaps some input from ops?

There is lots of good info in keep-balance/usage.go. Was planning on putting most of it in the install doc, but maybe a user guide page is more appropriate?

Sure, it looks like it can be split between config.default.yml, install doc, and ... a new page on the admin guide?

#16 Updated by Nico C├ęsar 15 days ago

Tom Clegg wrote:

  • KeepServiceTypes: In the config wiki KeepServiceTypes is mapped to Volumes. I'm assuming this is meant to map to Volume.Driver types? This seems contingent on keepstore cluster config work.

KeepServiceTypes supports filtering by service_type in the keep_services table, typically "disk" or "proxy". KeepServiceList supports using a specified (cached/fake/customized) set of keep_services rows.

I think we still need to support KeepServiceTypes until everyone has migrated their keepstore configs. After that,
  • keepstore/keepproxy server addresses will be listed separately in the Services section typical installs won't need to specify KeepServiceTypes.
  • for debugging/special situations, the list of servers can be controlled by using an altered version of the cluster config file.

We should check with ops to see whether there's still a need for specifying a subset or alternate list of services. If so, it should probably be done with per-volume flags (enable pull/trash) rather than per-server.

For most clusters the approach for wait-for-the-migration of keepstore configs is great.
Listing separate keepstore and keeproxy sounds good.

Per volume flags sounds great. Specially in scenarios that "we need to migrate a volume" like switching them to read only, or future expansions of draining volumes feature.

As an special case that some cluster have: sometimes, specially on prem, we have a keepstore service running on the compute nodes. How does global configuration affects this? I'm just pointing out a potential problem, maybe I'm over thinking.

That said, yes, at least for now let's just move these to Collections.BalanceCollectionBatch and Collections.BalanceCollectionBuffers.

From the Ops perspective: In the future I think this configuration knobs should have a recommended value at run-time based on the data available (an also auto select that value if needed and reporting the value in the logs). Specially useful with clusters that we don't have access to the keepstore servers, but we know that they could use resources more efficiently

I plan on keeping all the flag options since keep-balance can be run once instead of as a service.

  • KeepServiceList: Right now this is only a command line option. Unless we think its worth specifying a set of keep services to balance in the config, this will stay the same.

Currently KeepServiceList can also be given literally in the keep-balance config file though, right? (see above re keeping/dropping)

  • commit-pulls/commit-trash: These are mapped to Collections.BlobReplicateConcurrency/Collections.BlobTrashConcurrency respectively, but we might want to consider settings these to false by default if keep-balance is run with --once to avoid a one-time op accidentally committing changes. We could also require these to be set explicitly if --once is used.

If we were starting fresh I'd say a "-n" (dry run) flag would be good -- but we're not, and changing the default from "don't commit" to "commit" seems iffy. Perhaps some input from ops?

I don't mind personally having changes in keep-balance flags. As long as we document it well and we have a version that shows it as deprecated.

There is lots of good info in keep-balance/usage.go. Was planning on putting most of it in the install doc, but maybe a user guide page is more appropriate?

Sure, it looks like it can be split between config.default.yml, install doc, and ... a new page on the admin guide?

I like a new page in the admin page. Also adding a section "before you begging... think about your storage layer" that explains why you should have several keepstore servers. Talk a little about throughput and N-to-M connections. The audience for this section should be sysadmins that they've been managing NFS servers and RAIDs or similar technology and replication in Arvados is a hard concept to grasp.

#17 Updated by Tom Clegg 14 days ago

  • Related to Feature #9255: [keep] drain mode for a keepstore service added

#18 Updated by Tom Morris 7 days ago

  • Target version changed from 2019-09-11 Sprint to 2019-09-25 Sprint

#19 Updated by Tom Clegg 1 day ago

  • Description updated (diff)

Also available in: Atom PDF