Feature #13647

[keepstore] Load cluster configuration file

Added by Tom Clegg about 1 year ago. Updated about 18 hours ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/11/2019
Due date:
% Done:

50%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Subtasks

Task #15440: Review 13647-keepstore-configIn ProgressTom Clegg

Task #15462: Review 13647-load-old-configResolvedPeter Amstutz


Related issues

Related to Arvados - Story #13648: [Epic] Use one cluster configuration file for all componentsNew

Related to Arvados - Story #15575: [API] [keepstore] Sync config file with keep_services tableNew

Related to Arvados - Story #15641: [keep-balance] [SDKs] rendezvous by volume UUID instead of server UUIDNew

Associated revisions

Revision 7f2569f2
Added by Tom Clegg 2 months ago

Merge branch '13647-load-old-config'

refs #13647

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg about 1 year ago

  • Related to Story #13648: [Epic] Use one cluster configuration file for all components added

#2 Updated by Tom Clegg about 1 year ago

  • Tracker changed from Story to Feature

#3 Updated by Tom Morris 9 months ago

  • Target version set to To Be Groomed

#4 Updated by Tom Morris 7 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0

#5 Updated by Tom Morris 3 months ago

  • Target version changed from Arvados Future Sprints to 2019-07-03 Sprint

#6 Updated by Tom Morris 3 months ago

  • Target version changed from 2019-07-03 Sprint to Arvados Future Sprints

#7 Updated by Tom Clegg 3 months ago

  • Target version changed from Arvados Future Sprints to 2019-07-17 Sprint
  • Assigned To set to Tom Clegg

#8 Updated by Tom Clegg 2 months ago

13647-load-old-config @ 0c894574ca46b77209127a4908844c2e0e734cea -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1384/

rearranges some things:
  • new Loader type, with SetupFlags() method used by all commands, so these will all use the overrides from the specified keepstore config
    • "arvados-server config-check -legacy-keepstore-config /path/to/keepstore.yml"
    • "arvados-server controller -legacy-keepstore-config /path/to/keepstore.yml"
  • change "version" command so "arvados-server controller -version" and "arvados-controller -version" work

The minimal example (which we should be able to expand to other components + the rest of the keepstore configs after merging this) loads /etc/arvados/keepstore/keepstore.yml (or the specified location) if it exists, and uses those values to override the cluster config. The "apply keepstore config" func doesn't need to emit warnings about overrides, but it should use ldr.Logger to warn about configs that are no longer supported, or can't be automatically migrated.

I think we can address auto-detection for ambiguous -config args in components that used to have their own -config format (e.g., "keepstore -config foo.yml") like this:
  • first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
  • if err == config.ErrNoClustersDefined then
    • log a warning that we're guessing the config is a legacy keepstore config
    • use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
    • replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args

#9 Updated by Peter Amstutz 2 months ago

While you're tinkering with this, could you add an ARVADOS_CONFIG environment variable to control where it looks for config.yml? This would be helpful for testing.

#11 Updated by Peter Amstutz 2 months ago

config/cmd.go checkCommand.RunCommand():

    loader.SkipDeprecated = true
    withoutDepr, err := loader.Load()
    if err != nil {
        return 1
    }
    loader.SkipDeprecated = false
    withDepr, err := loader.Load()
    if err != nil {
        return 1
    }

It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?

service/cmd.go command.RunCommand

    loader := config.NewLoader(stdin, log)
    loader.SetupFlags(flags)

NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

I think we can address auto-detection for ambiguous -config args in components that used to have their own -config format (e.g., "keepstore -config foo.yml") like this:
  • first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
  • if err == config.ErrNoClustersDefined then
    • log a warning that we're guessing the config is a legacy keepstore config
    • use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
    • replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args

Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?

So if I'm following along, what I should be doing for crunch-dispatch-slurm is:

  • Update crunch-dispatch-slurm to use the main config
  • Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
  • Add -legacy-crunch-dispatch-slurm-config option
  • In the main method, create a Loader and call loader.Load() to get the config object

Anything else?

#12 Updated by Tom Clegg 2 months ago

Peter Amstutz wrote:

config/cmd.go checkCommand.RunCommand():
It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?

It means don't load them at all.

Added comments to the SkipDeprecated field and the "load twice" code block.

NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).

Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?

Yes, this should surely be in the config package, and the component would just do this

args, err := loader.MungeLegacyConfigArgs(args, "-legacy-keepstore-config")

It might be annoying to make this work with stuff like "-config -". I propose we just don't support loading old configs that way, i.e., don't try to munge anything if the -config arg isn't a [symlink to a] regular file.

  • Update crunch-dispatch-slurm to use the main config
  • Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
  • Add -legacy-crunch-dispatch-slurm-config option
  • In the main method, create a Loader and call loader.Load() to get the config object

Sounds right.

#13 Updated by Tom Clegg 2 months ago

13647-load-old-config @ 3310da050bd763055a67b2395f122ab5c01cbdf9 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1389/
  • adds MungeLegacyConfigArgs method to loader

#14 Updated by Peter Amstutz 2 months ago

Tom Clegg wrote:

13647-load-old-config @ 3310da050bd763055a67b2395f122ab5c01cbdf9 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1389/
  • adds MungeLegacyConfigArgs method to loader

This LGTM.

#15 Updated by Peter Amstutz 2 months ago

Tom Clegg wrote:

Peter Amstutz wrote:
NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).

func NewLoader(stdin io.Reader, logger logrus.FieldLogger) *Loader {
    ldr := &Loader{Stdin: stdin, Logger: logger}
    ldr.SetupFlags(flag.NewFlagSet("", flag.ContinueOnError)) // what is this line for ???
    ldr.Path = "-" 
    return ldr
}
func (c *command) RunCommand(prog string, args []string, stdin io.Reader, stdout, stderr io.Writer) int {
        ...

    loader := config.NewLoader(stdin, log)
    loader.SetupFlags(flags)
    versionFlag := flags.Bool("version", false, "Write version information to stdout and exit 0")

    err = flags.Parse(args)

        ...
}

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

#16 Updated by Tom Clegg 2 months ago

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

Ooooh, sorry. I thought you were asking about the flagset in RunCommand.

The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.

The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.

13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/

#17 Updated by Peter Amstutz 2 months ago

Tom Clegg wrote:

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

Ooooh, sorry. I thought you were asking about the flagset in RunCommand.

The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.

The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.

13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/

Great. This LGTM.

#18 Updated by Tom Morris 2 months ago

  • Target version changed from 2019-07-17 Sprint to 2019-07-31 Sprint

#19 Updated by Peter Amstutz 2 months ago

  • Status changed from New to In Progress

#20 Updated by Tom Clegg about 2 months ago

  • Target version changed from 2019-07-31 Sprint to 2019-08-14 Sprint

#21 Updated by Peter Amstutz about 2 months ago

Have you considered how this will synchronize with the keep_services and/or keep_disks tables. Nico has reported that one of the biggest hassles in setting up a cluster is having to set up the API server, then configure the keep servers, and then go back and update the keep_services table. Could the API server keep_services endpoint be changed to return values from the cluster config and eliminate the keep_services table entirely?

#22 Updated by Tom Clegg about 2 months ago

Yes, in the future clients will just use the Services and Volumes sections of the config and (if AccessViaHosts is in play) rendezvous hash on the volume ID rather than the server ID. Presumably we'll provide a shim so old clients that query /arvados/v1/keep_services still work. The present issue is just about moving keepstore config from keepstore.yml to config.yml though, so I don't think it will change the keep_services behavior at all -- although config migration will need to query keep_services in an attempt to generate new volume UUIDs that match the rendezvous order of the existing keep_service UUIDs.

#23 Updated by Tom Clegg about 1 month ago

  • Target version changed from 2019-08-14 Sprint to 2019-08-28 Sprint

#24 Updated by Peter Amstutz 28 days ago

  • Related to Story #15575: [API] [keepstore] Sync config file with keep_services table added

#25 Updated by Tom Morris 21 days ago

  • Target version changed from 2019-08-28 Sprint to 2019-09-11 Sprint

#26 Updated by Tom Clegg 9 days ago

Now that I'm looking at the docs, explaining to humans how to make volume UUIDs match up with keep_services entries might be harder than writing controller/railsAPI code to answer keep_services queries from cluster config if the keep_services table hasn't been populated... InternalURLs entries just need a Rendezvous field that we can return as the unique part of the UUIDs so rendezvous order is preserved. Migrating the tests (which expect to bring keep services up & down without restarting RailsAPI) might be painful, though.

#27 Updated by Tom Clegg 7 days ago

so far: 13647-keepstore-config @ b144996ab73b15233f027713faf3114f34985e72 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1518/

surprises:
  • If the keep_services table is empty, RailsAPI invents contents based on the config file. (It doesn't write to the DB table, though -- it makes a literal table and queries from that. This lets us keep all the filtering stuff, while avoiding problems with multiple RailsAPI fighting over the database.)
todo:
  • Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
  • Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
  • Update S3 and Azure doc pages
  • arvbox

#28 Updated by Tom Clegg 7 days ago

  • Target version changed from 2019-09-11 Sprint to 2019-09-25 Sprint

#29 Updated by Tom Clegg 7 days ago

13647-keepstore-config @ 08d2f48c3cf3eac9d71261172677c54f03669fe5

with updated S3 and Azure install docs

#30 Updated by Lucas Di Pentima 5 days ago

These are my comments/questions as a first review pass. Maybe some questions reflect misunderstandings of Keep internals from my part, but I did re-read the available documentation so I think is a good opportunity to enhance it.

  • File lib/config/config.default.yml
    • MaxKeepBlockBuffers’s current placement (inside .API) doesn’t seem ideal, wouldn’t be better to place it on .Collections with the other Blob* settings?
      • Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using MaxKeepBlobBuffers or BlockTrashLifetime, for example?
    • Related: BlobSigning’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix.
    • At L379 there’s a mention about a keepstore -blob-signature-ttl argument that’s now deprecated.
    • Maybe the .Volumes' different sections could be documented a little more like other parts of the default config file.
  • File lib/config/deprecated_keepstore.go
    • Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?
    • Regarding loadOldKeepstoreConfig(), I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?
  • At f04693da1811e670d4cbb981debeecf14d79137c there’re some controller/dispatchcloud prometheus updates, are those related to this branch?
  • Nit: KeepServiceController.from_config_or_db() may be called .from_db_or_config() to respect precedence? Also, why not override KeepService.all() method?
  • Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?
  • Documentation
    • Filesystem storage config section
      • AccessViaHost purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking at export.go) The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?
      • When not using AccessViaHost, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.
    • S3 storage config section
      • The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.
    • Azure config section:
      • Preexisting typo detected: “configued"
    • Maybe the Rendezvous key should be explained on the default config file as it also becomes part of the documentation.
    • I think a Volume config migration guide would be useful.

#31 Updated by Tom Clegg 2 days ago

  • MaxKeepBlockBuffers’s current placement (inside .API) doesn’t seem ideal, wouldn’t be better to place it on .Collections with the other Blob* settings?

Not sure what to do about this. API.* is where the other server process/resource tuning things are, in which sense this fits although "API" might not be the greatest name. The Collections.* section is more about functional aspects of storage.

  • Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using MaxKeepBlobBuffers or BlockTrashLifetime, for example?

Changed to "blob" since that seems to be the majority.

  • Related: BlobSigning’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix.

Indeed. Fixed.

  • At L379 there’s a mention about a keepstore -blob-signature-ttl argument that’s now deprecated.

Removed.

  • Maybe the .Volumes' different sections could be documented a little more like other parts of the default config file.

Yes... How about just linking to the s3/azure/fs install doc pages? I don't want to duplicate, and I think the doc pages benefit from having multiple examples, which might be awkward to add to the defaults file.

  • Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?

Removed those checks. I think they made sense when this func was responsible for logging "using a different XYZ from legacy config", but that's not the case any more.

  • Regarding loadOldKeepstoreConfig(), I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?

Indeed. Split the volume migration into 2 separate funcs.

I moved the prometheus setup and TLS into lib/service in order to preserve keepstore's TLS capability without adding more generic http server stuff in the keepstore package. controller/dispatchcloud had to catch up to the new "service handler" signature.

  • Nit: KeepServiceController.from_config_or_db() may be called .from_db_or_config() to respect precedence? Also, why not override KeepService.all() method?

Yeah, good point. Done. (TIL at the class level "from(sql)" is one of many aliases for "all()" so there's an "unscoped" in there to avoid infinite recursion when all() calls from().)

  • Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?

Hm... Ideally "config-check" should be the one to report this, but it doesn't necessarily have access to the DB, and RailsAPI isn't necessarily running. Maybe it's next best if keepstore reports it, since keepstore already expects RailsAPI to be up when checking config?

13647-keepstore-config @ 76ee36c49088274ab4e8465c0f4b20878d66c706 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1531/

(haven't dealt with the doc comments yet)

#32 Updated by Tom Clegg 1 day ago

  • AccessViaHost purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking at export.go)

I added comments on AccessViaHosts in the default config.

Yes. The idea is to use rendezvous hashing for volumes rather than gateway servers. Ideally clients should choose a volume using rendezvous hashing, then choose randomly from the gateways that have the needed (read or write) access to that volume. This means clients need to know a) the volume UUIDs in order to do rendezvous ordering, and b) the server-to-volume mapping, so they can connect to the appropriate servers.

The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?

I suppose, if you define "other keepstore’s volumes" to mean "volumes that are also used by another keepstore".

Omitting AccessViaHosts -- or listing multiple servers there -- is equivalent to a legacy config setup where the same volume entry appears in multiple hosts' /etc/arvados/keepstore/keepstore.yml file.

  • When not using AccessViaHost, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.

Yes. I've reiterated this in the added AccessViaHosts comments. (Are there other places it should be mentioned?)

  • The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.

Yes. Copied the AccessViaHosts example from s3/azure into fs as well.

  • Preexisting typo detected: “configued"

Fixed. (grep found 2 of these, and a bonus Arvado.)

  • Maybe the Rendezvous key should be explained on the default config file as it also becomes part of the documentation.

Done.

  • I think a Volume config migration guide would be useful.

Yes.

13647-keepstore-config @ baa2bf80cc078868191494ccb1631c426f4e0496 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1532/

Todo:
  • Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
  • Prompt "delete everything in your keep_services table" when appropriate (TBD where -- config-check? RailsAPI logs?)
  • Document which server/volume arrangements work best (during the transition from server- to volume-rendezvous, some setups might not rendezvous very well, and impact performance / efficiency)
  • Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
  • arvbox

#33 Updated by Tom Clegg 1 day ago

  • Related to Story #15641: [keep-balance] [SDKs] rendezvous by volume UUID instead of server UUID added

#34 Updated by Lucas Di Pentima about 19 hours ago

Some more comments:

  • File lib/config/deprecated_keepstore.go
    • Nit: Line 146 has a punctuation mark that makes go-lint nervous
  • Regarding API.MaxKeepBlobBuffers' placement, a second look at API.* revealed some other keys being used by servers not being RailsAPI, I was confused about its purpose (and agree that API may not be the best name).
  • File services/keepstore/command.go
    • Log messages include old settings' names in lines 152 & 158
  • File services/keepstore/handler_test.go
    • Log message includes old setting's name in line 939
    • There’re mentions of RequireSignatures in lines 330 & 396
  • There’re several mentions of the old BlobSignatureTTL setting name on comments at several services/keepstore/* files
  • File doc/install/install-keepstore.html.textile.liquid line 79 mentions EnableDelete
  • There’re several mentions of the old TrashLifetime setting name on keepstore’s drivers files
  • File services/keepstore/s3_volume.go mentions also TrashCheckInterval in line 715
  • File services/keepstore/s3_volume.go mentions -trash-lifetime and -s3-unsafe-delete in line 116
  • File services/keepstore/volume.go mentions -trash-lifetime in line 173

#35 Updated by Tom Clegg about 18 hours ago

(from chat) another TODO: update/remove the obsolete "keepstore config file" sections of install-keep-balance.html

Also available in: Atom PDF