Project

General

Profile

Actions

Feature #13647

closed

[keepstore] Load cluster configuration file

Added by Tom Clegg almost 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
2.0
Release relationship:
Auto

Subtasks 2 (0 open2 closed)

Task #15440: Review 13647-keepstore-configResolvedTom Clegg09/11/2019Actions
Task #15462: Review 13647-load-old-configResolvedPeter Amstutz07/11/2019Actions

Related issues

Related to Arvados - Idea #13648: [Epic] Use one cluster configuration file for all componentsResolvedActions
Related to Arvados - Idea #15575: [API] [keepstore] Sync config file with keep_services tableResolvedActions
Related to Arvados - Idea #15641: [keep-balance] [SDKs] rendezvous by volume UUID instead of server UUIDNewActions
Actions #1

Updated by Tom Clegg almost 6 years ago

  • Related to Idea #13648: [Epic] Use one cluster configuration file for all components added
Actions #2

Updated by Tom Clegg almost 6 years ago

  • Tracker changed from Idea to Feature
Actions #3

Updated by Tom Morris over 5 years ago

  • Target version set to To Be Groomed
Actions #4

Updated by Tom Morris about 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0
Actions #5

Updated by Tom Morris almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2019-07-03 Sprint
Actions #6

Updated by Tom Morris almost 5 years ago

  • Target version changed from 2019-07-03 Sprint to Arvados Future Sprints
Actions #7

Updated by Tom Clegg almost 5 years ago

  • Assigned To set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2019-07-17 Sprint
Actions #8

Updated by Tom Clegg almost 5 years ago

13647-load-old-config @ 0c894574ca46b77209127a4908844c2e0e734cea -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1384/

rearranges some things:
  • new Loader type, with SetupFlags() method used by all commands, so these will all use the overrides from the specified keepstore config
    • "arvados-server config-check -legacy-keepstore-config /path/to/keepstore.yml"
    • "arvados-server controller -legacy-keepstore-config /path/to/keepstore.yml"
  • change "version" command so "arvados-server controller -version" and "arvados-controller -version" work

The minimal example (which we should be able to expand to other components + the rest of the keepstore configs after merging this) loads /etc/arvados/keepstore/keepstore.yml (or the specified location) if it exists, and uses those values to override the cluster config. The "apply keepstore config" func doesn't need to emit warnings about overrides, but it should use ldr.Logger to warn about configs that are no longer supported, or can't be automatically migrated.

I think we can address auto-detection for ambiguous -config args in components that used to have their own -config format (e.g., "keepstore -config foo.yml") like this:
  • first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
  • if err == config.ErrNoClustersDefined then
    • log a warning that we're guessing the config is a legacy keepstore config
    • use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
    • replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args
Actions #9

Updated by Peter Amstutz almost 5 years ago

While you're tinkering with this, could you add an ARVADOS_CONFIG environment variable to control where it looks for config.yml? This would be helpful for testing.

Actions #11

Updated by Peter Amstutz almost 5 years ago

config/cmd.go checkCommand.RunCommand():

    loader.SkipDeprecated = true
    withoutDepr, err := loader.Load()
    if err != nil {
        return 1
    }
    loader.SkipDeprecated = false
    withDepr, err := loader.Load()
    if err != nil {
        return 1
    }

It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?

service/cmd.go command.RunCommand

    loader := config.NewLoader(stdin, log)
    loader.SetupFlags(flags)

NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

I think we can address auto-detection for ambiguous -config args in components that used to have their own -config format (e.g., "keepstore -config foo.yml") like this:
  • first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
  • if err == config.ErrNoClustersDefined then
    • log a warning that we're guessing the config is a legacy keepstore config
    • use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
    • replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args

Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?

So if I'm following along, what I should be doing for crunch-dispatch-slurm is:

  • Update crunch-dispatch-slurm to use the main config
  • Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
  • Add -legacy-crunch-dispatch-slurm-config option
  • In the main method, create a Loader and call loader.Load() to get the config object

Anything else?

Actions #12

Updated by Tom Clegg almost 5 years ago

Peter Amstutz wrote:

config/cmd.go checkCommand.RunCommand():
It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?

It means don't load them at all.

Added comments to the SkipDeprecated field and the "load twice" code block.

NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).

Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?

Yes, this should surely be in the config package, and the component would just do this

args, err := loader.MungeLegacyConfigArgs(args, "-legacy-keepstore-config")

It might be annoying to make this work with stuff like "-config -". I propose we just don't support loading old configs that way, i.e., don't try to munge anything if the -config arg isn't a [symlink to a] regular file.

  • Update crunch-dispatch-slurm to use the main config
  • Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
  • Add -legacy-crunch-dispatch-slurm-config option
  • In the main method, create a Loader and call loader.Load() to get the config object

Sounds right.

Actions #13

Updated by Tom Clegg almost 5 years ago

13647-load-old-config @ 3310da050bd763055a67b2395f122ab5c01cbdf9 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1389/
  • adds MungeLegacyConfigArgs method to loader
Actions #14

Updated by Peter Amstutz almost 5 years ago

Tom Clegg wrote:

13647-load-old-config @ 3310da050bd763055a67b2395f122ab5c01cbdf9 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1389/
  • adds MungeLegacyConfigArgs method to loader

This LGTM.

Actions #15

Updated by Peter Amstutz almost 5 years ago

Tom Clegg wrote:

Peter Amstutz wrote:
NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?

This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).

func NewLoader(stdin io.Reader, logger logrus.FieldLogger) *Loader {
    ldr := &Loader{Stdin: stdin, Logger: logger}
    ldr.SetupFlags(flag.NewFlagSet("", flag.ContinueOnError)) // what is this line for ???
    ldr.Path = "-" 
    return ldr
}
func (c *command) RunCommand(prog string, args []string, stdin io.Reader, stdout, stderr io.Writer) int {
        ...

    loader := config.NewLoader(stdin, log)
    loader.SetupFlags(flags)
    versionFlag := flags.Bool("version", false, "Write version information to stdout and exit 0")

    err = flags.Parse(args)

        ...
}

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

Actions #16

Updated by Tom Clegg almost 5 years ago

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

Ooooh, sorry. I thought you were asking about the flagset in RunCommand.

The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.

The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.

13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/

Actions #17

Updated by Peter Amstutz almost 5 years ago

Tom Clegg wrote:

I still don't understand how the flag.NewFlagSet() that is passed to ldr.SetupFlags() in NewLoader() ever gets used for anything? The RunCommmand() method calls SetupFlags() on a different flag set that is actually used for Parse().

Ooooh, sorry. I thought you were asking about the flagset in RunCommand.

The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.

The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.

13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/

Great. This LGTM.

Actions #18

Updated by Tom Morris almost 5 years ago

  • Target version changed from 2019-07-17 Sprint to 2019-07-31 Sprint
Actions #19

Updated by Peter Amstutz almost 5 years ago

  • Status changed from New to In Progress
Actions #20

Updated by Tom Clegg over 4 years ago

  • Target version changed from 2019-07-31 Sprint to 2019-08-14 Sprint
Actions #21

Updated by Peter Amstutz over 4 years ago

Have you considered how this will synchronize with the keep_services and/or keep_disks tables. Nico has reported that one of the biggest hassles in setting up a cluster is having to set up the API server, then configure the keep servers, and then go back and update the keep_services table. Could the API server keep_services endpoint be changed to return values from the cluster config and eliminate the keep_services table entirely?

Actions #22

Updated by Tom Clegg over 4 years ago

Yes, in the future clients will just use the Services and Volumes sections of the config and (if AccessViaHosts is in play) rendezvous hash on the volume ID rather than the server ID. Presumably we'll provide a shim so old clients that query /arvados/v1/keep_services still work. The present issue is just about moving keepstore config from keepstore.yml to config.yml though, so I don't think it will change the keep_services behavior at all -- although config migration will need to query keep_services in an attempt to generate new volume UUIDs that match the rendezvous order of the existing keep_service UUIDs.

Actions #23

Updated by Tom Clegg over 4 years ago

  • Target version changed from 2019-08-14 Sprint to 2019-08-28 Sprint
Actions #24

Updated by Peter Amstutz over 4 years ago

  • Related to Idea #15575: [API] [keepstore] Sync config file with keep_services table added
Actions #25

Updated by Tom Morris over 4 years ago

  • Target version changed from 2019-08-28 Sprint to 2019-09-11 Sprint
Actions #26

Updated by Tom Clegg over 4 years ago

Now that I'm looking at the docs, explaining to humans how to make volume UUIDs match up with keep_services entries might be harder than writing controller/railsAPI code to answer keep_services queries from cluster config if the keep_services table hasn't been populated... InternalURLs entries just need a Rendezvous field that we can return as the unique part of the UUIDs so rendezvous order is preserved. Migrating the tests (which expect to bring keep services up & down without restarting RailsAPI) might be painful, though.

Actions #27

Updated by Tom Clegg over 4 years ago

so far: 13647-keepstore-config @ b144996ab73b15233f027713faf3114f34985e72 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1518/

surprises:
  • If the keep_services table is empty, RailsAPI invents contents based on the config file. (It doesn't write to the DB table, though -- it makes a literal table and queries from that. This lets us keep all the filtering stuff, while avoiding problems with multiple RailsAPI fighting over the database.)
todo:
  • Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
  • Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
  • Update S3 and Azure doc pages
  • arvbox
Actions #28

Updated by Tom Clegg over 4 years ago

  • Target version changed from 2019-09-11 Sprint to 2019-09-25 Sprint
Actions #29

Updated by Tom Clegg over 4 years ago

13647-keepstore-config @ 08d2f48c3cf3eac9d71261172677c54f03669fe5

with updated S3 and Azure install docs

Actions #30

Updated by Lucas Di Pentima over 4 years ago

These are my comments/questions as a first review pass. Maybe some questions reflect misunderstandings of Keep internals from my part, but I did re-read the available documentation so I think is a good opportunity to enhance it.

  • File lib/config/config.default.yml
    • MaxKeepBlockBuffers’s current placement (inside .API) doesn’t seem ideal, wouldn’t be better to place it on .Collections with the other Blob* settings?
      • Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using MaxKeepBlobBuffers or BlockTrashLifetime, for example?
    • Related: BlobSigning’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix.
    • At L379 there’s a mention about a keepstore -blob-signature-ttl argument that’s now deprecated.
    • Maybe the .Volumes' different sections could be documented a little more like other parts of the default config file.
  • File lib/config/deprecated_keepstore.go
    • Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?
    • Regarding loadOldKeepstoreConfig(), I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?
  • At f04693da1811e670d4cbb981debeecf14d79137c there’re some controller/dispatchcloud prometheus updates, are those related to this branch?
  • Nit: KeepServiceController.from_config_or_db() may be called .from_db_or_config() to respect precedence? Also, why not override KeepService.all() method?
  • Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?
  • Documentation
    • Filesystem storage config section
      • AccessViaHost purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking at export.go) The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?
      • When not using AccessViaHost, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.
    • S3 storage config section
      • The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.
    • Azure config section:
      • Preexisting typo detected: “configued"
    • Maybe the Rendezvous key should be explained on the default config file as it also becomes part of the documentation.
    • I think a Volume config migration guide would be useful.
Actions #31

Updated by Tom Clegg over 4 years ago

  • MaxKeepBlockBuffers’s current placement (inside .API) doesn’t seem ideal, wouldn’t be better to place it on .Collections with the other Blob* settings?

Not sure what to do about this. API.* is where the other server process/resource tuning things are, in which sense this fits although "API" might not be the greatest name. The Collections.* section is more about functional aspects of storage.

  • Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using MaxKeepBlobBuffers or BlockTrashLifetime, for example?

Changed to "blob" since that seems to be the majority.

  • Related: BlobSigning’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix.

Indeed. Fixed.

  • At L379 there’s a mention about a keepstore -blob-signature-ttl argument that’s now deprecated.

Removed.

  • Maybe the .Volumes' different sections could be documented a little more like other parts of the default config file.

Yes... How about just linking to the s3/azure/fs install doc pages? I don't want to duplicate, and I think the doc pages benefit from having multiple examples, which might be awkward to add to the defaults file.

  • Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?

Removed those checks. I think they made sense when this func was responsible for logging "using a different XYZ from legacy config", but that's not the case any more.

  • Regarding loadOldKeepstoreConfig(), I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?

Indeed. Split the volume migration into 2 separate funcs.

I moved the prometheus setup and TLS into lib/service in order to preserve keepstore's TLS capability without adding more generic http server stuff in the keepstore package. controller/dispatchcloud had to catch up to the new "service handler" signature.

  • Nit: KeepServiceController.from_config_or_db() may be called .from_db_or_config() to respect precedence? Also, why not override KeepService.all() method?

Yeah, good point. Done. (TIL at the class level "from(sql)" is one of many aliases for "all()" so there's an "unscoped" in there to avoid infinite recursion when all() calls from().)

  • Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?

Hm... Ideally "config-check" should be the one to report this, but it doesn't necessarily have access to the DB, and RailsAPI isn't necessarily running. Maybe it's next best if keepstore reports it, since keepstore already expects RailsAPI to be up when checking config?

13647-keepstore-config @ 76ee36c49088274ab4e8465c0f4b20878d66c706 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1531/

(haven't dealt with the doc comments yet)

Actions #32

Updated by Tom Clegg over 4 years ago

  • AccessViaHost purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking at export.go)

I added comments on AccessViaHosts in the default config.

Yes. The idea is to use rendezvous hashing for volumes rather than gateway servers. Ideally clients should choose a volume using rendezvous hashing, then choose randomly from the gateways that have the needed (read or write) access to that volume. This means clients need to know a) the volume UUIDs in order to do rendezvous ordering, and b) the server-to-volume mapping, so they can connect to the appropriate servers.

The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?

I suppose, if you define "other keepstore’s volumes" to mean "volumes that are also used by another keepstore".

Omitting AccessViaHosts -- or listing multiple servers there -- is equivalent to a legacy config setup where the same volume entry appears in multiple hosts' /etc/arvados/keepstore/keepstore.yml file.

  • When not using AccessViaHost, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.

Yes. I've reiterated this in the added AccessViaHosts comments. (Are there other places it should be mentioned?)

  • The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.

Yes. Copied the AccessViaHosts example from s3/azure into fs as well.

  • Preexisting typo detected: “configued"

Fixed. (grep found 2 of these, and a bonus Arvado.)

  • Maybe the Rendezvous key should be explained on the default config file as it also becomes part of the documentation.

Done.

  • I think a Volume config migration guide would be useful.

Yes.

13647-keepstore-config @ baa2bf80cc078868191494ccb1631c426f4e0496 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1532/

Todo:
  • Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
  • Prompt "delete everything in your keep_services table" when appropriate (TBD where -- config-check? RailsAPI logs?)
  • Document which server/volume arrangements work best (during the transition from server- to volume-rendezvous, some setups might not rendezvous very well, and impact performance / efficiency)
  • Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
  • arvbox
Actions #33

Updated by Tom Clegg over 4 years ago

  • Related to Idea #15641: [keep-balance] [SDKs] rendezvous by volume UUID instead of server UUID added
Actions #34

Updated by Lucas Di Pentima over 4 years ago

Some more comments:

  • File lib/config/deprecated_keepstore.go
    • Nit: Line 146 has a punctuation mark that makes go-lint nervous
  • Regarding API.MaxKeepBlobBuffers' placement, a second look at API.* revealed some other keys being used by servers not being RailsAPI, I was confused about its purpose (and agree that API may not be the best name).
  • File services/keepstore/command.go
    • Log messages include old settings' names in lines 152 & 158
  • File services/keepstore/handler_test.go
    • Log message includes old setting's name in line 939
    • There’re mentions of RequireSignatures in lines 330 & 396
  • There’re several mentions of the old BlobSignatureTTL setting name on comments at several services/keepstore/* files
  • File doc/install/install-keepstore.html.textile.liquid line 79 mentions EnableDelete
  • There’re several mentions of the old TrashLifetime setting name on keepstore’s drivers files
  • File services/keepstore/s3_volume.go mentions also TrashCheckInterval in line 715
  • File services/keepstore/s3_volume.go mentions -trash-lifetime and -s3-unsafe-delete in line 116
  • File services/keepstore/volume.go mentions -trash-lifetime in line 173
Actions #35

Updated by Tom Clegg over 4 years ago

(from chat) another TODO: update/remove the obsolete "keepstore config file" sections of install-keep-balance.html

Actions #36

Updated by Tom Clegg over 4 years ago

13647-keepstore-config @ 28f250ee0a43e873760d43b39119e5710de75fe8 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1540/
  • address points in note-34
  • update migration and upgrading-to-master notes
  • update keep-balance install docs (references to keepstore configs)
  • config-check, config-diff, keepstore, etc. detect when entries in keep_services table have been migrated to cluster config, and recommend deleting them
Actions #37

Updated by Lucas Di Pentima over 4 years ago

Updates at 28f250ee0a43e873760d43b39119e5710de75fe8 LGTM.

I've tried running arvbox and came across a couple of issues. It seems that the 'killall' command isn't installed on the docker image, as I read on keepstore's log:

2019-09-20_14:01:59.87422 + ARVADOS_API_TOKEN=3ab3rzow7l5bpt833h2ypyn0sonhbhx111islj1eyjcj3n0in2
2019-09-20_14:01:59.87422 + set +e
 2019-09-20_14:01:59.87422 + read -rd '' keepservice 
2019-09-20_14:01:59.87445 + set -e 
2019-09-20_14:01:59.87447 + test -s /var/lib/arvados/keep0-uuid 
[...]
2019-09-20_14:02:57.39765 + killall -HUP keepproxy 
2019-09-20_14:02:57.40994 /usr/local/lib/arvbox/keep-setup.sh: line 48: killall: command not found

Keepstore seems to run but the arvbox start script doesn’t realize that. There's some unknown IP addr being logged (31144156a2c0 == 65.86.162.192?):

2019-09-20_14:14:04.53995 {"PID":10205,"level":"info","msg":"keepstore dev starting, pid 10205","time":"2019-09-20T14:14:04.539943358Z”}
2019-09-20_14:14:04.54595 2019/09/20 14:14:04 stat "/dev/sda1[/home/lucas/.arvbox/13647/var]": stat /dev/sda1[/home/lucas/.arvbox/13647/var]: no such file or directory; using blank DeviceID for volume [UnixVolume /var/lib/arvados/keep0] 
2019-09-20_14:14:04.54599 {"PID":10205,"level":"info","msg":"started volume x2st0-nyw5e-n7blw3fakc8sf0j ([UnixVolume /var/lib/arvados/keep0]), ReadOnly=false","time":"2019-09-20T14:14:04.545965490Z”} 
2019-09-20_14:14:04.54882 2019/09/20 14:14:04 stat "/dev/sda1[/home/lucas/.arvbox/13647/var]": stat /dev/sda1[/home/lucas/.arvbox/13647/var]: no such file or directory; using blank DeviceID for volume [UnixVolume /var/lib/arvados/keep0] 
2019-09-20_14:14:04.57585 {"Listen":"172.17.0.2:25107","PID":10205,"Service":"keepstore","URL":"http://31144156a2c0:25107","level":"info","msg":"listening","time":"2019-09-20T14:14:04.575779238Z"}
Actions #38

Updated by Tom Clegg over 4 years ago

13647-keepstore-config @ 8cbb55c614cc5942e9ea807bb7ca4df8160d1952 changes "killall keepproxy" to "sv hup /service/keepproxy" (it seems like "killall" has never worked there, since psmisc has never been installed -- maybe it just didn't matter until now?)

Actions #39

Updated by Tom Clegg over 4 years ago

13647-keepstore-config @ 3cec8c10796bb6de9c02c96bbe7dd644c7366e42 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1542/
  • change "killall keepproxy" to "sv hup .../keepproxy"
  • add keepstore entries to arvbox config (arvbox comes up now)
Actions #40

Updated by Lucas Di Pentima over 4 years ago

Arvbox started OK on my side too. Couldn't upload data from the host (but could do it from within the container):

2019-09-23 11:42:43 arvados.arv_put[10174] INFO: Calculating upload size, this could take some time...
2019-09-23 11:42:43 arvados.arv_put[10174] INFO: Creating new cache file at /home/lucas/.cache/arvados/arv-put/b26f64529f678ed834a4f2d8f247fabe
0M / 23M 0.2% Traceback (most recent call last):
  File "/home/lucas/venv-arvados/bin/arv-put", line 7, in <module>
    main()
  File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/commands/put.py", line 1219, in main
    writer.start(save_collection=not(args.stream or args.raw))
  File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/commands/put.py", line 606, in start
    self._local_collection.manifest_text()
[...]
  File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/keep.py", line 1200, in put
    data_hash, copies, writer_pool.done()), service_errors, label="service")
arvados.errors.KeepWriteError: failed to write 677acd81ebd63b31de90c87872edf9ca (wanted 1 copies but wrote 0): service http://localhost:25107/ responded with 0 (7, 'Failed to connect to localhost port 25107: Connection refused'); service http://localhost:25108/ responded with 0 (7, 'Failed to connect to localhost port 25108: Connection refused')
(venv-arvados) lucas@test:~/arvados$ curl ^C
(venv-arvados) lucas@test:~/arvados$ telnet ^C
(venv-arvados) lucas@test:~/arvados$ curl http://localhost:25108/
curl: (7) Failed to connect to localhost port 25108: Connection refused

Keepproxy is set up at https://172.17.0.2:25101 and is indeed reachable, here's what the API is returning:

(venv-arvados) lucas@test:~/arvados$ arv keep_service list
{
 "kind":"arvados#keepServiceList",
 "etag":"",
 "self_link":"",
 "offset":0,
 "limit":100,
 "items":[
  {
   "href":"/keep_services/x39c4-bi6l4-14yfg20l04tv2by",
   "kind":"arvados#keepService",
   "etag":"8worjtrkyjgq5ei343rx49h4h",
   "uuid":"x39c4-bi6l4-14yfg20l04tv2by",
   "owner_uuid":"x39c4-tpzed-000000000000000",
   "created_at":"2019-09-23T13:57:39.343377000Z",
   "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo",
   "modified_by_user_uuid":"x39c4-tpzed-000000000000000",
   "modified_at":"2019-09-23T14:39:34.597405000Z",
   "service_host":"localhost",
   "service_port":25108,
   "service_ssl_flag":false,
   "service_type":"disk",
   "read_only":false
  },
  {
   "href":"/keep_services/x39c4-bi6l4-rir0jlmrpbgili1",
   "kind":"arvados#keepService",
   "etag":"omffft1qfwdw7ce8nn3lqef6",
   "uuid":"x39c4-bi6l4-rir0jlmrpbgili1",
   "owner_uuid":"x39c4-tpzed-000000000000000",
   "created_at":"2019-09-23T13:57:39.192282000Z",
   "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo",
   "modified_by_user_uuid":"x39c4-tpzed-000000000000000",
   "modified_at":"2019-09-23T14:39:34.485381000Z",
   "service_host":"localhost",
   "service_port":25107,
   "service_ssl_flag":false,
   "service_type":"disk",
   "read_only":false
  },
  {
   "href":"/keep_services/x39c4-bi6l4-vl1b3ahbpmaly3u",
   "kind":"arvados#keepService",
   "etag":"a89cmvkm2ntxjv7zfq78cv4dx",
   "uuid":"x39c4-bi6l4-vl1b3ahbpmaly3u",
   "owner_uuid":"x39c4-tpzed-000000000000000",
   "created_at":"2019-09-23T13:57:39.216467000Z",
   "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo",
   "modified_by_user_uuid":"x39c4-tpzed-000000000000000",
   "modified_at":"2019-09-23T13:57:39.217019000Z",
   "service_host":"172.17.0.2",
   "service_port":25101,
   "service_ssl_flag":true,
   "service_type":"proxy",
   "read_only":false
  }
 ],
 "items_available":3
}

Is this correct or should the keepstores be listening on 172.17.0.2 too?

Actions #41

Updated by Tom Clegg over 4 years ago

  • Target version changed from 2019-09-25 Sprint to 2019-10-09 Sprint
Actions #42

Updated by Tom Clegg over 4 years ago

13647-keepstore-config @ 6527c64378ebd1970be69f6502db181c8272da5a
  • migrate keepstore volume configs to cluster config
  • set the X-External-Client header for non-arvbox clients
Actions #43

Updated by Lucas Di Pentima over 4 years ago

Tested both inside and outside arvbox, works great!

6527c64378ebd1970be69f6502db181c8272da5a LGTM, thanks!

Actions #44

Updated by Peter Amstutz over 4 years ago

There's a circular dependency problem in checkPendingKeepstoreMigrations():

  1. arvados-server is invoked during server initialization to get the configuration
  2. configuration loading includes calling checkPendingKeepstoreMigrations(), which makes a call to the API server
  3. at this point in API server initialization, nginx/passenger is in a state where it accepts connections and queues them to be handled as soon as the service finishes initializing
  4. the API call in checkPendingKeepstoreMigrations never returns (because it is queued)
  5. API server startup is deadlocked with this error
2019-09-27_14:15:06.18217 [ E 2019-09-27 14:15:05.6522 7949/Tj age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /usr/src/arvados/services/api: A timeout occurred while starting a preloader process.
2019-09-27_14:15:06.18220   Error ID: 14dfac70
2019-09-27_14:15:06.18220   Error details saved to: /tmp/passenger-error-hBWqPs.html
2019-09-27_14:15:06.18221 
2019-09-27_14:15:06.18221 [ E 2019-09-27 14:15:05.6575 7949/T9 age/Cor/Con/CheckoutSession.cpp:276 ]: [Client 1-1] Cannot checkout session because a spawning error occurred. The identifier of the error is 14dfac70. Please see earlier logs for details about the error.

Commenting out the API server call in checkPendingKeepstoreMigrations() are restarting immediately fixed the problem.

Actions #45

Updated by Eric Biagiotti over 4 years ago

Also, service/cmd/RunCommand loads the config which tries to hit keep_services before checking to see if SystemRootToken is in the config and falling back to ARVADOS_API_TOKEN. This results in 401 Unauthorized: Not logged in.

Maybe move service/cmd.go ln 113-116 to lib/config/load.go ln 266?

Actions #46

Updated by Lucas Di Pentima over 4 years ago

I've tried starting a new arvbox dev instance from current master and it worked without issues.

Actions #47

Updated by Tom Clegg over 4 years ago

Deadlock @ startup isn't consistent: it only happens when controller is up before RailsAPI. Otherwise, the RailsAPI→config-dump→keep_services call just fails, and RailsAPI starts up OK.

Fixed by skipping the "pending keepstore migrations" check when -skip-legacy flag is used (RailsAPI config-dump uses it).

13647-keepstore-config @ 110580d079cdb0b0a773ecf1671c1f97f1736cc6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1551/

Moved the env var checks from service startup to config loader, so ARVADOS_API_TOKEN can be used by keepstore migration code.

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1553/

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1554/

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1555/

Actions #48

Updated by Peter Amstutz over 4 years ago

Tom Clegg wrote:

Deadlock @ startup isn't consistent: it only happens when controller is up before RailsAPI. Otherwise, the RailsAPI→config-dump→keep_services call just fails, and RailsAPI starts up OK.

Fixed by skipping the "pending keepstore migrations" check when -skip-legacy flag is used (RailsAPI config-dump uses it).

13647-keepstore-config @ 110580d079cdb0b0a773ecf1671c1f97f1736cc6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1551/

Moved the env var checks from service startup to config loader, so ARVADOS_API_TOKEN can be used by keepstore migration code.

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1553/

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1554/

13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1555/

This LGTM.

Actions #49

Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
Actions #50

Updated by Peter Amstutz over 4 years ago

  • Release set to 22
Actions

Also available in: Atom PDF