Feature #13647
closed[keepstore] Load cluster configuration file
Added by Tom Clegg over 6 years ago. Updated almost 5 years ago.
Updated by Tom Clegg over 6 years ago
- Related to Idea #13648: [Epic] Use one cluster configuration file for all components added
Updated by Tom Morris almost 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
- Story points set to 2.0
Updated by Tom Morris over 5 years ago
- Target version changed from Arvados Future Sprints to 2019-07-03 Sprint
Updated by Tom Morris over 5 years ago
- Target version changed from 2019-07-03 Sprint to Arvados Future Sprints
Updated by Tom Clegg over 5 years ago
- Assigned To set to Tom Clegg
- Target version changed from Arvados Future Sprints to 2019-07-17 Sprint
Updated by Tom Clegg over 5 years ago
13647-load-old-config @ 0c894574ca46b77209127a4908844c2e0e734cea -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1384/
rearranges some things:- new Loader type, with SetupFlags() method used by all commands, so these will all use the overrides from the specified keepstore config
- "arvados-server config-check -legacy-keepstore-config /path/to/keepstore.yml"
- "arvados-server controller -legacy-keepstore-config /path/to/keepstore.yml"
- change "version" command so "arvados-server controller -version" and "arvados-controller -version" work
The minimal example (which we should be able to expand to other components + the rest of the keepstore configs after merging this) loads /etc/arvados/keepstore/keepstore.yml (or the specified location) if it exists, and uses those values to override the cluster config. The "apply keepstore config" func doesn't need to emit warnings about overrides, but it should use ldr.Logger to warn about configs that are no longer supported, or can't be automatically migrated.
I think we can address auto-detection for ambiguous-config
args in components that used to have their own -config
format (e.g., "keepstore -config foo.yml") like this:
- first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
if err == config.ErrNoClustersDefined
then- log a warning that we're guessing the config is a legacy keepstore config
- use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
- replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args
Updated by Peter Amstutz over 5 years ago
While you're tinkering with this, could you add an ARVADOS_CONFIG environment variable to control where it looks for config.yml? This would be helpful for testing.
Updated by Tom Clegg over 5 years ago
13647-load-old-config @ 3a81f54c8c0da6babdfb9014a64079b6b41f73c7 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1387/
Updated by Peter Amstutz over 5 years ago
config/cmd.go checkCommand.RunCommand():
loader.SkipDeprecated = true withoutDepr, err := loader.Load() if err != nil { return 1 } loader.SkipDeprecated = false withDepr, err := loader.Load() if err != nil { return 1 }
It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?
service/cmd.go command.RunCommand
loader := config.NewLoader(stdin, log) loader.SetupFlags(flags)
NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?
I think we can address auto-detection for ambiguous-config
args in components that used to have their own-config
format (e.g., "keepstore -config foo.yml") like this:
- first try to load foo.yml as a cluster config file (the way source:lib/service/cmd.go does it)
if err == config.ErrNoClustersDefined
then
- log a warning that we're guessing the config is a legacy keepstore config
- use flagset.Visit() to make a new args slice, changing "-config" to "-legacy-keepstore-config" and leaving everything else the same
- replace the flagset and loader with new ones, and redo flag parsing & config loading using the munged args
Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?
So if I'm following along, what I should be doing for crunch-dispatch-slurm is:
- Update crunch-dispatch-slurm to use the main config
- Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
- Add -legacy-crunch-dispatch-slurm-config option
- In the main method, create a Loader and call loader.Load() to get the config object
Anything else?
Updated by Tom Clegg over 5 years ago
Peter Amstutz wrote:
config/cmd.go checkCommand.RunCommand():
It isn't totally obvious what this does. Needs some comments. What does "SkipDeprecated" mean? Suppress warnings, don't load them at all, or something else?
It means don't load them at all.
Added comments to the SkipDeprecated field and the "load twice" code block.
NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?
This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).
Seems reasonable. Could this try-and-try-again behavior be done in a generic way instead of being duplicated across every component? It really just needs to know what the fallback flag is, right? Maybe even an option to NewLoader() and put the behavior in Load() ?
Yes, this should surely be in the config package, and the component would just do this
args, err := loader.MungeLegacyConfigArgs(args, "-legacy-keepstore-config")
It might be annoying to make this work with stuff like "-config -". I propose we just don't support loading old configs that way, i.e., don't try to munge anything if the -config
arg isn't a [symlink to a] regular file.
- Update crunch-dispatch-slurm to use the main config
- Implement loadOldCrunchDispatchSlurmConfig() and add it in the right place
- Add -legacy-crunch-dispatch-slurm-config option
- In the main method, create a Loader and call loader.Load() to get the config object
Sounds right.
Updated by Tom Clegg over 5 years ago
- adds MungeLegacyConfigArgs method to loader
Updated by Peter Amstutz over 5 years ago
Tom Clegg wrote:
13647-load-old-config @ 3310da050bd763055a67b2395f122ab5c01cbdf9 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1389/
- adds MungeLegacyConfigArgs method to loader
This LGTM.
Updated by Peter Amstutz over 5 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
NewLoader() calls ldr.SetupFlags() a NewFlagSet() that seems like a throwaway object? What's the purpose? Should NewLoader() take flags as a parameter?This is how the config library gets to define the config-related cli flags, defaults, etc. Two lines later, flags.Parse(args) -- now that the loader has registered its CLI flags, parse the arguments we've been given. So if we are called as "arvados-server config-dump -legacy-keepstore-config /foo" then the loader's KeepstorePath changes to "/foo" when we call flags.Parse(args).
func NewLoader(stdin io.Reader, logger logrus.FieldLogger) *Loader { ldr := &Loader{Stdin: stdin, Logger: logger} ldr.SetupFlags(flag.NewFlagSet("", flag.ContinueOnError)) // what is this line for ??? ldr.Path = "-" return ldr }
func (c *command) RunCommand(prog string, args []string, stdin io.Reader, stdout, stderr io.Writer) int { ... loader := config.NewLoader(stdin, log) loader.SetupFlags(flags) versionFlag := flags.Bool("version", false, "Write version information to stdout and exit 0") err = flags.Parse(args) ... }
I still don't understand how the flag.NewFlagSet()
that is passed to ldr.SetupFlags()
in NewLoader()
ever gets used for anything? The RunCommmand()
method calls SetupFlags()
on a different flag set that is actually used for Parse().
Updated by Tom Clegg over 5 years ago
I still don't understand how the
flag.NewFlagSet()
that is passed toldr.SetupFlags()
inNewLoader()
ever gets used for anything? TheRunCommmand()
method callsSetupFlags()
on a different flag set that is actually used for Parse().
Ooooh, sorry. I thought you were asking about the flagset in RunCommand.
The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.
The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.
13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/
Updated by Peter Amstutz over 5 years ago
Tom Clegg wrote:
I still don't understand how the
flag.NewFlagSet()
that is passed toldr.SetupFlags()
inNewLoader()
ever gets used for anything? TheRunCommmand()
method callsSetupFlags()
on a different flag set that is actually used for Parse().Ooooh, sorry. I thought you were asking about the flagset in RunCommand.
The flagset in NewLoader() is indeed a throwaway. The only reason to call SetupFlags() there is to get side effect of setting the default values. I've added a comment to explain that.
The next line (ldr.Path = "-") is a bug left over from testing, though -- I've deleted that, and fixed the tests to do that themselves.
13647-load-old-config @ e723b02bc8e5382575f27b0ce803f775adf6d479 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1390/
Great. This LGTM.
Updated by Tom Morris over 5 years ago
- Target version changed from 2019-07-17 Sprint to 2019-07-31 Sprint
Updated by Peter Amstutz over 5 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 5 years ago
- Target version changed from 2019-07-31 Sprint to 2019-08-14 Sprint
Updated by Peter Amstutz over 5 years ago
Have you considered how this will synchronize with the keep_services and/or keep_disks tables. Nico has reported that one of the biggest hassles in setting up a cluster is having to set up the API server, then configure the keep servers, and then go back and update the keep_services table. Could the API server keep_services endpoint be changed to return values from the cluster config and eliminate the keep_services table entirely?
Updated by Tom Clegg over 5 years ago
Yes, in the future clients will just use the Services and Volumes sections of the config and (if AccessViaHosts is in play) rendezvous hash on the volume ID rather than the server ID. Presumably we'll provide a shim so old clients that query /arvados/v1/keep_services still work. The present issue is just about moving keepstore config from keepstore.yml to config.yml though, so I don't think it will change the keep_services behavior at all -- although config migration will need to query keep_services in an attempt to generate new volume UUIDs that match the rendezvous order of the existing keep_service UUIDs.
Updated by Tom Clegg over 5 years ago
- Target version changed from 2019-08-14 Sprint to 2019-08-28 Sprint
Updated by Peter Amstutz over 5 years ago
- Related to Idea #15575: [API] [keepstore] Sync config file with keep_services table added
Updated by Tom Morris over 5 years ago
- Target version changed from 2019-08-28 Sprint to 2019-09-11 Sprint
Updated by Tom Clegg over 5 years ago
Now that I'm looking at the docs, explaining to humans how to make volume UUIDs match up with keep_services entries might be harder than writing controller/railsAPI code to answer keep_services queries from cluster config if the keep_services table hasn't been populated... InternalURLs entries just need a Rendezvous field that we can return as the unique part of the UUIDs so rendezvous order is preserved. Migrating the tests (which expect to bring keep services up & down without restarting RailsAPI) might be painful, though.
Updated by Tom Clegg over 5 years ago
so far: 13647-keepstore-config @ b144996ab73b15233f027713faf3114f34985e72 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1518/
surprises:- If the keep_services table is empty, RailsAPI invents contents based on the config file. (It doesn't write to the DB table, though -- it makes a literal table and queries from that. This lets us keep all the filtering stuff, while avoiding problems with multiple RailsAPI fighting over the database.)
- Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
- Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
- Update S3 and Azure doc pages
- arvbox
Updated by Tom Clegg over 5 years ago
- Target version changed from 2019-09-11 Sprint to 2019-09-25 Sprint
Updated by Tom Clegg over 5 years ago
13647-keepstore-config @ 08d2f48c3cf3eac9d71261172677c54f03669fe5
with updated S3 and Azure install docs
Updated by Lucas Di Pentima over 5 years ago
These are my comments/questions as a first review pass. Maybe some questions reflect misunderstandings of Keep internals from my part, but I did re-read the available documentation so I think is a good opportunity to enhance it.
- File
lib/config/config.default.yml
MaxKeepBlockBuffers
’s current placement (inside.API
) doesn’t seem ideal, wouldn’t be better to place it on.Collections
with the other Blob* settings?- Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using
MaxKeepBlobBuffers
orBlockTrashLifetime
, for example?
- Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using
- Related:
BlobSigning
’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix. - At L379 there’s a mention about a keepstore
-blob-signature-ttl
argument that’s now deprecated. - Maybe the
.Volumes
' different sections could be documented a little more like other parts of the default config file.
- File
lib/config/deprecated_keepstore.go
- Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?
- Regarding
loadOldKeepstoreConfig()
, I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?
- At f04693da1811e670d4cbb981debeecf14d79137c there’re some controller/dispatchcloud prometheus updates, are those related to this branch?
- Nit:
KeepServiceController.from_config_or_db()
may be called.from_db_or_config()
to respect precedence? Also, why not overrideKeepService.all()
method? - Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?
- Documentation
- Filesystem storage config section
AccessViaHost
purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking atexport.go
) The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?- When not using
AccessViaHost
, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.
- S3 storage config section
- The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.
- Azure config section:
- Preexisting typo detected: “configued"
- Maybe the
Rendezvous
key should be explained on the default config file as it also becomes part of the documentation. - I think a Volume config migration guide would be useful.
- Filesystem storage config section
Updated by Tom Clegg over 5 years ago
MaxKeepBlockBuffers
’s current placement (inside.API
) doesn’t seem ideal, wouldn’t be better to place it on.Collections
with the other Blob* settings?
Not sure what to do about this. API.* is where the other server process/resource tuning things are, in which sense this fits although "API" might not be the greatest name. The Collections.* section is more about functional aspects of storage.
- Related (and maybe silly) question: What’s the difference between “data block” and “blob”? On our documentation/config items seem to use both terms for the same concept. Why not using
MaxKeepBlobBuffers
orBlockTrashLifetime
, for example?
Changed to "blob" since that seems to be the majority.
- Related:
BlobSigning
’s documentation on lines 316-324 seem to refer to the opposite to what this setting means. This wasn’t written in this branch but I think it’s a good opportunity for a fix.
Indeed. Fixed.
- At L379 there’s a mention about a keepstore
-blob-signature-ttl
argument that’s now deprecated.
Removed.
- Maybe the
.Volumes
' different sections could be documented a little more like other parts of the default config file.
Yes... How about just linking to the s3/azure/fs install doc pages? I don't want to duplicate, and I think the doc pages benefit from having multiple examples, which might be awkward to add to the defaults file.
- Readability/Code style question: On lines 160-171 and similar cases, wouldn’t be clearer (by having less code) to not check for equality on those settings and assigning whatever not-nil value the old config has?
Removed those checks. I think they made sense when this func was responsible for logging "using a different XYZ from legacy config", but that's not the case any more.
- Regarding
loadOldKeepstoreConfig()
, I struggled a bit to follow the different code paths there. Could it be splitted up/simplified for readablity?
Indeed. Split the volume migration into 2 separate funcs.
- At f04693da1811e670d4cbb981debeecf14d79137c there’re some controller/dispatchcloud prometheus updates, are those related to this branch?
I moved the prometheus setup and TLS into lib/service in order to preserve keepstore's TLS capability without adding more generic http server stuff in the keepstore package. controller/dispatchcloud had to catch up to the new "service handler" signature.
- Nit:
KeepServiceController.from_config_or_db()
may be called.from_db_or_config()
to respect precedence? Also, why not overrideKeepService.all()
method?
Yeah, good point. Done. (TIL at the class level "from(sql)" is one of many aliases for "all()" so there's an "unscoped" in there to avoid infinite recursion when all() calls from().)
- Should the RailsAPI suggest the site admin to migrate the KeepServices to Volumes config when seeing records on that table? Maybe also warn the admin that any Volume config will be ignored by old clients while migration is not complete?
Hm... Ideally "config-check" should be the one to report this, but it doesn't necessarily have access to the DB, and RailsAPI isn't necessarily running. Maybe it's next best if keepstore reports it, since keepstore already expects RailsAPI to be up when checking config?
13647-keepstore-config @ 76ee36c49088274ab4e8465c0f4b20878d66c706 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1531/
(haven't dealt with the doc comments yet)
Updated by Tom Clegg over 5 years ago
AccessViaHost
purpose is not 100% clear to me. Are we exposing all volumes of any keepstore to the rest of the cluster? (it seems so by looking atexport.go
)
I added comments on AccessViaHosts in the default config.
Yes. The idea is to use rendezvous hashing for volumes rather than gateway servers. Ideally clients should choose a volume using rendezvous hashing, then choose randomly from the gateways that have the needed (read or write) access to that volume. This means clients need to know a) the volume UUIDs in order to do rendezvous ordering, and b) the server-to-volume mapping, so they can connect to the appropriate servers.
The round-robin mentioned at the first part makes me think that keepstore hosts may include other keepstore’s volumes on that decision, is that true?
I suppose, if you define "other keepstore’s volumes" to mean "volumes that are also used by another keepstore".
Omitting AccessViaHosts -- or listing multiple servers there -- is equivalent to a legacy config setup where the same volume entry appears in multiple hosts' /etc/arvados/keepstore/keepstore.yml file.
- When not using
AccessViaHost
, does that mean that the Volume is used by ALL keepstores? If yes, maybe it should be made more explicit.
Yes. I've reiterated this in the added AccessViaHosts comments. (Are there other places it should be mentioned?)
- The one-bucket-many-keepstores case sounds similar to the one-nas-many-keepstores case on the filesystem section, but config between them look inconsistent in the sense that on the NAS case AccessViaHosts is not needed, should we make them consistent with the S3 case? Further reading on the S3 config comments, seems like the filesystem case would also support AccessViaHosts with ReadOnly? If so, maybe it should be documented.
Yes. Copied the AccessViaHosts example from s3/azure into fs as well.
- Preexisting typo detected: “configued"
Fixed. (grep found 2 of these, and a bonus Arvado.)
- Maybe the
Rendezvous
key should be explained on the default config file as it also becomes part of the documentation.
Done.
- I think a Volume config migration guide would be useful.
Yes.
13647-keepstore-config @ baa2bf80cc078868191494ccb1631c426f4e0496 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1532/
Todo:- Mention keepstore on the config migration doc page (same as keepproxy, plus "do it on each keepstore node", plus "delete everything in your keep_services table after migrating"
- Prompt "delete everything in your keep_services table" when appropriate (TBD where -- config-check? RailsAPI logs?)
- Document which server/volume arrangements work best (during the transition from server- to volume-rendezvous, some setups might not rendezvous very well, and impact performance / efficiency)
- Mention "can't use CLI args any more" in upgrade notes (they've been deprecated for a long time)
- arvbox
Updated by Tom Clegg over 5 years ago
- Related to Idea #15641: [keep-balance] [SDKs] rendezvous by volume UUID instead of server UUID added
Updated by Lucas Di Pentima over 5 years ago
Some more comments:
- File
lib/config/deprecated_keepstore.go
- Nit: Line 146 has a punctuation mark that makes
go-lint
nervous
- Nit: Line 146 has a punctuation mark that makes
- Regarding
API.MaxKeepBlobBuffers
' placement, a second look atAPI.*
revealed some other keys being used by servers not being RailsAPI, I was confused about its purpose (and agree thatAPI
may not be the best name). - File
services/keepstore/command.go
- Log messages include old settings' names in lines 152 & 158
- File
services/keepstore/handler_test.go
- Log message includes old setting's name in line 939
- There’re mentions of
RequireSignatures
in lines 330 & 396
- There’re several mentions of the old
BlobSignatureTTL
setting name on comments at several services/keepstore/* files - File
doc/install/install-keepstore.html.textile.liquid
line 79 mentionsEnableDelete
- There’re several mentions of the old
TrashLifetime
setting name on keepstore’s drivers files - File
services/keepstore/s3_volume.go
mentions alsoTrashCheckInterval
in line 715 - File
services/keepstore/s3_volume.go
mentions-trash-lifetime
and-s3-unsafe-delete
in line 116 - File
services/keepstore/volume.go
mentions-trash-lifetime
in line 173
Updated by Tom Clegg over 5 years ago
(from chat) another TODO: update/remove the obsolete "keepstore config file" sections of install-keep-balance.html
Updated by Tom Clegg over 5 years ago
- address points in note-34
- update migration and upgrading-to-master notes
- update keep-balance install docs (references to keepstore configs)
- config-check, config-diff, keepstore, etc. detect when entries in keep_services table have been migrated to cluster config, and recommend deleting them
Updated by Lucas Di Pentima over 5 years ago
Updates at 28f250ee0a43e873760d43b39119e5710de75fe8 LGTM.
I've tried running arvbox
and came across a couple of issues. It seems that the 'killall' command isn't installed on the docker image, as I read on keepstore
's log:
2019-09-20_14:01:59.87422 + ARVADOS_API_TOKEN=3ab3rzow7l5bpt833h2ypyn0sonhbhx111islj1eyjcj3n0in2 2019-09-20_14:01:59.87422 + set +e 2019-09-20_14:01:59.87422 + read -rd '' keepservice 2019-09-20_14:01:59.87445 + set -e 2019-09-20_14:01:59.87447 + test -s /var/lib/arvados/keep0-uuid [...] 2019-09-20_14:02:57.39765 + killall -HUP keepproxy 2019-09-20_14:02:57.40994 /usr/local/lib/arvbox/keep-setup.sh: line 48: killall: command not found
Keepstore seems to run but the arvbox start script doesn’t realize that. There's some unknown IP addr being logged (31144156a2c0 == 65.86.162.192?):
2019-09-20_14:14:04.53995 {"PID":10205,"level":"info","msg":"keepstore dev starting, pid 10205","time":"2019-09-20T14:14:04.539943358Z”} 2019-09-20_14:14:04.54595 2019/09/20 14:14:04 stat "/dev/sda1[/home/lucas/.arvbox/13647/var]": stat /dev/sda1[/home/lucas/.arvbox/13647/var]: no such file or directory; using blank DeviceID for volume [UnixVolume /var/lib/arvados/keep0] 2019-09-20_14:14:04.54599 {"PID":10205,"level":"info","msg":"started volume x2st0-nyw5e-n7blw3fakc8sf0j ([UnixVolume /var/lib/arvados/keep0]), ReadOnly=false","time":"2019-09-20T14:14:04.545965490Z”} 2019-09-20_14:14:04.54882 2019/09/20 14:14:04 stat "/dev/sda1[/home/lucas/.arvbox/13647/var]": stat /dev/sda1[/home/lucas/.arvbox/13647/var]: no such file or directory; using blank DeviceID for volume [UnixVolume /var/lib/arvados/keep0] 2019-09-20_14:14:04.57585 {"Listen":"172.17.0.2:25107","PID":10205,"Service":"keepstore","URL":"http://31144156a2c0:25107","level":"info","msg":"listening","time":"2019-09-20T14:14:04.575779238Z"}
Updated by Tom Clegg over 5 years ago
13647-keepstore-config @ 8cbb55c614cc5942e9ea807bb7ca4df8160d1952 changes "killall keepproxy" to "sv hup /service/keepproxy" (it seems like "killall" has never worked there, since psmisc has never been installed -- maybe it just didn't matter until now?)
Updated by Tom Clegg over 5 years ago
- change "killall keepproxy" to "sv hup .../keepproxy"
- add keepstore entries to arvbox config (arvbox comes up now)
Updated by Lucas Di Pentima over 5 years ago
Arvbox started OK on my side too. Couldn't upload data from the host (but could do it from within the container):
2019-09-23 11:42:43 arvados.arv_put[10174] INFO: Calculating upload size, this could take some time... 2019-09-23 11:42:43 arvados.arv_put[10174] INFO: Creating new cache file at /home/lucas/.cache/arvados/arv-put/b26f64529f678ed834a4f2d8f247fabe 0M / 23M 0.2% Traceback (most recent call last): File "/home/lucas/venv-arvados/bin/arv-put", line 7, in <module> main() File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/commands/put.py", line 1219, in main writer.start(save_collection=not(args.stream or args.raw)) File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/commands/put.py", line 606, in start self._local_collection.manifest_text() [...] File "/home/lucas/venv-arvados/lib/python3.5/site-packages/arvados/keep.py", line 1200, in put data_hash, copies, writer_pool.done()), service_errors, label="service") arvados.errors.KeepWriteError: failed to write 677acd81ebd63b31de90c87872edf9ca (wanted 1 copies but wrote 0): service http://localhost:25107/ responded with 0 (7, 'Failed to connect to localhost port 25107: Connection refused'); service http://localhost:25108/ responded with 0 (7, 'Failed to connect to localhost port 25108: Connection refused') (venv-arvados) lucas@test:~/arvados$ curl ^C (venv-arvados) lucas@test:~/arvados$ telnet ^C (venv-arvados) lucas@test:~/arvados$ curl http://localhost:25108/ curl: (7) Failed to connect to localhost port 25108: Connection refused
Keepproxy is set up at https://172.17.0.2:25101
and is indeed reachable, here's what the API is returning:
(venv-arvados) lucas@test:~/arvados$ arv keep_service list { "kind":"arvados#keepServiceList", "etag":"", "self_link":"", "offset":0, "limit":100, "items":[ { "href":"/keep_services/x39c4-bi6l4-14yfg20l04tv2by", "kind":"arvados#keepService", "etag":"8worjtrkyjgq5ei343rx49h4h", "uuid":"x39c4-bi6l4-14yfg20l04tv2by", "owner_uuid":"x39c4-tpzed-000000000000000", "created_at":"2019-09-23T13:57:39.343377000Z", "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo", "modified_by_user_uuid":"x39c4-tpzed-000000000000000", "modified_at":"2019-09-23T14:39:34.597405000Z", "service_host":"localhost", "service_port":25108, "service_ssl_flag":false, "service_type":"disk", "read_only":false }, { "href":"/keep_services/x39c4-bi6l4-rir0jlmrpbgili1", "kind":"arvados#keepService", "etag":"omffft1qfwdw7ce8nn3lqef6", "uuid":"x39c4-bi6l4-rir0jlmrpbgili1", "owner_uuid":"x39c4-tpzed-000000000000000", "created_at":"2019-09-23T13:57:39.192282000Z", "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo", "modified_by_user_uuid":"x39c4-tpzed-000000000000000", "modified_at":"2019-09-23T14:39:34.485381000Z", "service_host":"localhost", "service_port":25107, "service_ssl_flag":false, "service_type":"disk", "read_only":false }, { "href":"/keep_services/x39c4-bi6l4-vl1b3ahbpmaly3u", "kind":"arvados#keepService", "etag":"a89cmvkm2ntxjv7zfq78cv4dx", "uuid":"x39c4-bi6l4-vl1b3ahbpmaly3u", "owner_uuid":"x39c4-tpzed-000000000000000", "created_at":"2019-09-23T13:57:39.216467000Z", "modified_by_client_uuid":"x39c4-ozdt8-7hl7rzjk9ytyqjo", "modified_by_user_uuid":"x39c4-tpzed-000000000000000", "modified_at":"2019-09-23T13:57:39.217019000Z", "service_host":"172.17.0.2", "service_port":25101, "service_ssl_flag":true, "service_type":"proxy", "read_only":false } ], "items_available":3 }
Is this correct or should the keepstores be listening on 172.17.0.2
too?
Updated by Tom Clegg over 5 years ago
- Target version changed from 2019-09-25 Sprint to 2019-10-09 Sprint
Updated by Tom Clegg over 5 years ago
- migrate keepstore volume configs to cluster config
- set the X-External-Client header for non-arvbox clients
Updated by Lucas Di Pentima over 5 years ago
Tested both inside and outside arvbox, works great!
6527c64378ebd1970be69f6502db181c8272da5a LGTM, thanks!
Updated by Peter Amstutz over 5 years ago
There's a circular dependency problem in checkPendingKeepstoreMigrations():
- arvados-server is invoked during server initialization to get the configuration
- configuration loading includes calling checkPendingKeepstoreMigrations(), which makes a call to the API server
- at this point in API server initialization, nginx/passenger is in a state where it accepts connections and queues them to be handled as soon as the service finishes initializing
- the API call in checkPendingKeepstoreMigrations never returns (because it is queued)
- API server startup is deadlocked with this error
2019-09-27_14:15:06.18217 [ E 2019-09-27 14:15:05.6522 7949/Tj age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /usr/src/arvados/services/api: A timeout occurred while starting a preloader process. 2019-09-27_14:15:06.18220 Error ID: 14dfac70 2019-09-27_14:15:06.18220 Error details saved to: /tmp/passenger-error-hBWqPs.html 2019-09-27_14:15:06.18221 2019-09-27_14:15:06.18221 [ E 2019-09-27 14:15:05.6575 7949/T9 age/Cor/Con/CheckoutSession.cpp:276 ]: [Client 1-1] Cannot checkout session because a spawning error occurred. The identifier of the error is 14dfac70. Please see earlier logs for details about the error.
Commenting out the API server call in checkPendingKeepstoreMigrations() are restarting immediately fixed the problem.
Updated by Eric Biagiotti over 5 years ago
Also, service/cmd/RunCommand loads the config which tries to hit keep_services before checking to see if SystemRootToken is in the config and falling back to ARVADOS_API_TOKEN. This results in 401 Unauthorized: Not logged in.
Maybe move service/cmd.go ln 113-116 to lib/config/load.go ln 266?
Updated by Lucas Di Pentima over 5 years ago
I've tried starting a new arvbox dev instance from current master and it worked without issues.
Updated by Tom Clegg over 5 years ago
Deadlock @ startup isn't consistent: it only happens when controller is up before RailsAPI. Otherwise, the RailsAPI→config-dump→keep_services call just fails, and RailsAPI starts up OK.
Fixed by skipping the "pending keepstore migrations" check when -skip-legacy flag is used (RailsAPI config-dump uses it).
13647-keepstore-config @ 110580d079cdb0b0a773ecf1671c1f97f1736cc6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1551/
Moved the env var checks from service startup to config loader, so ARVADOS_API_TOKEN can be used by keepstore migration code.
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1553/
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1554/
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1555/
Updated by Peter Amstutz over 5 years ago
Tom Clegg wrote:
Deadlock @ startup isn't consistent: it only happens when controller is up before RailsAPI. Otherwise, the RailsAPI→config-dump→keep_services call just fails, and RailsAPI starts up OK.
Fixed by skipping the "pending keepstore migrations" check when -skip-legacy flag is used (RailsAPI config-dump uses it).
13647-keepstore-config @ 110580d079cdb0b0a773ecf1671c1f97f1736cc6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1551/
Moved the env var checks from service startup to config loader, so ARVADOS_API_TOKEN can be used by keepstore migration code.
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1553/
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1554/
13647-keepstore-config @ 22cff6bb354d6980c33a0f471d5860f968e853a6 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1555/
This LGTM.
Updated by Tom Clegg about 5 years ago
- Status changed from In Progress to Resolved