Project

General

Profile

Actions

Feature #21773

closed

keep-web and keepproxy should use Services.Keepstore.InternalURLs configs directly instead of requesting /arvados/v1/keep_services/accessible

Added by Tom Clegg 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release:
Release relationship:
Auto

Description

Background:

Currently sdk/go/keepclient polls the /arvados/v1/keep_services/accessible endpoint (which controller proxies through to railsapi) to find keepstore endpoints.

When the client is configured using environment variables, ARVADOS_KEEP_SERVICES can be used to override discovery. However, when the client is configured using /etc/arvados/config.yml, ARVADOS_KEEP_SERVICES is ignored.

This makes it unnecessarily awkward to override only (for example) keep-web's upstream keepstore services for testing/troubleshooting purposes: in order for keep-web to use alternate backends, one would need to also run a controller+railsapi pair using the altered config.

Proposed change:

System services that act as keep clients (i.e., Go programs that use arvados.NewClientFromConfig()) should obey the ARVADOS_KEEP_SERVICES environment variable if set, otherwise use Services.Keepstore.InternalURLs from the config file.

They could also periodically (once an hour?) check the discovery endpoint and log a warning if it doesn't match. OTOH this might be redundant with the existing config-sync approach (system services report in-use config file checksums via metrics, and healthcheck reports if they aren't all equal).


Files


Subtasks 1 (0 open1 closed)

Task #21822: Review 21773-keep-service-discoveryResolvedTom Clegg09/11/2024Actions

Related issues 2 (1 open1 closed)

Related to Arvados - Feature #22114: Remove test suite's dependence on keep_services tableNewActions
Related to Arvados - Bug #22124: Cannot upload data through keep-web (workbench)ResolvedTom Clegg09/23/2024Actions
Actions #1

Updated by Tom Clegg 7 months ago

21773-keep-service-discovery @ 38051951da96557240fa64ffffe33a992f49d46d -- developer-run-tests: #4218

  • Allow ARVADOS_KEEP_SERVICES environment variable to override service discovery in server context.
Actions #2

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #3

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-07-03 sprint
Actions #4

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Actions #5

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint
Actions #6

Updated by Tom Clegg 4 months ago

21773-keep-service-discovery @ 15b52c2d4c69daae7000268db3d41f8f914fc1bb -- developer-run-tests: #4383

  • System services load keep services list from config file instead of requesting from controller/railsapi

We don't actually want this in test suites, because they rely on source:sdk/go/python/tests/run_test_servers.py to bring up keep servers and add them to the keep_services table so they override the config file in railsapi's keep_services handler. To allow this mechanism to keep working, I added a config entry to preserve the old behavior in tests.

Actions #7

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-08-07 sprint to Development 2024-08-28 sprint
Actions #8

Updated by Lucas Di Pentima 4 months ago

Some comments & questions:

  • Honestly, I don't like the inconsistency of one envvar priority versus another. As a sysadmin, it's natural to think that envvars always have precedence over anything else. Not sure if we should go through that rabbit hole here, but this behavior change will raise some eyebrows for sure.
  • Also, not sure I like having test specific config flags on the config file. Wouldn't it be better if we do the overriding with a different envvar? I think polluting the config file with things that noone (except us) will use should be avoided, unless using an alternate method is impossible/super difficult?
  • I thought that we don't save the keep services data in the database anymore? Regardless if that assumption is correct, I would prefer replacing "UseKeepServicesTable" with something like "UseKeepAccessibleEndpoint".
  • Commit e8c929c doesn't seem to mention the new behavior, it seems to me that's just a documentation correction on token creation. Maybe there's a missing commit pending to be pushed?
Actions #9

Updated by Lucas Di Pentima 4 months ago

Also, given that this is a behavior change, I think it would be convenient to document it on the upgrade notes just in case someone is relying on the old behavior.

Actions #10

Updated by Tom Clegg 4 months ago

Lucas Di Pentima wrote in #note-8:

  • Honestly, I don't like the inconsistency of one envvar priority versus another. As a sysadmin, it's natural to think that envvars always have precedence over anything else. Not sure if we should go through that rabbit hole here, but this behavior change will raise some eyebrows for sure.

(Assuming this refers to the fact that the config file takes precedence over ARVADOS_API_HOST/TOKEN/INSECURE, but not over ARVADOS_KEEP_SERVICES)

I think the situation here is just that NewClientFromConfig is used by server components, and the need for overriding ARVADOS_API_HOST just hasn't come up. We could add support for that easily enough, but then it might be a little weird that ARVADOS_API_TOKEN wouldn't have a clear meaning, because callers who use NewClientFromConfig decide themselves which token to use in various circumstances (sometimes a client-provided token, sometimes system root token).

I suppose we could log a warning ("this probably doesn't do what you want") when system services start up with non-empty ARVADOS_API_* env vars...?

  • Also, not sure I like having test specific config flags on the config file. Wouldn't it be better if we do the overriding with a different envvar? I think polluting the config file with things that noone (except us) will use should be avoided, unless using an alternate method is impossible/super difficult?

I actually started out using an env var for this, but it seemed a bit icky, and when I got to the point where arvados-server boot clears all ARVADOS_* env vars for its child processes, I thought it would be more clear to use a config key. But, yes, it does seem unfortunate to go down the road of config entries whose only purpose is testing.

I'll try changing back to ARVADOS_USE_KEEP_ACCESSIBLE_API and see how it looks.

  • I thought that we don't save the keep services data in the database anymore? Regardless if that assumption is correct, I would prefer replacing "UseKeepServicesTable" with something like "UseKeepAccessibleEndpoint".

The current logic in railsapi is "if there's anything in the database table, use it (because tests need this); otherwise use the config file."

  • Commit e8c929c doesn't seem to mention the new behavior, it seems to me that's just a documentation correction on token creation. Maybe there's a missing commit pending to be pushed?

Oops, I accidentally added e8c929ce89 to this branch -- it was part of #21910. I've removed it from this branch.

Actions #11

Updated by Tom Clegg 4 months ago

21773-keep-service-discovery @ b97dc3b710f2ea01edfa28b2dba36d4bd3ad1028 -- developer-run-tests: #4396
  • ARVADOS_USE_KEEP_ACCESSIBLE_API env var replaces config entry
Actions #12

Updated by Lucas Di Pentima 4 months ago

So, if I understood correctly, server components using NewClientFromConfig() used to ignore all ARVADOS_* envvars, and now they don't ignore ARVADOS_KEEP_SERVICES but still do the rest. I think this behavior change could potentially surprise some people, do you think it might need at least mentioning it in the upgrade notes?

What I tried to say is that ideally all envvars would have precedence over the rest of configuration means, and that it's unfortunate that now some have, and some other don't (inconsistency in behavior depending on which envvar we're looking at). A warning message I think is a good mitigation measure, and simple to implement.

Apart from these 2 things, it LGTM.

Actions #13

Updated by Lucas Di Pentima 4 months ago

Also, I tried but couldn't find the ticket where we made RailsAPI serve keep services in config file if the table was empty, but I kind of remember we left this behavior for migration purposes. If that's the case, the fact we (ab)use it to do testing is also unfortunate. We could take the opportunity of the 3.0 release to remove it entirely and use envvars to do testing. Not sure if that would be a entire project in itself, though.

Actions #14

Updated by Tom Clegg 4 months ago

Lucas Di Pentima wrote in #note-12:

So, if I understood correctly, server components using NewClientFromConfig() used to ignore all ARVADOS_* envvars, and now they don't ignore ARVADOS_KEEP_SERVICES but still do the rest. I think this behavior change could potentially surprise some people, do you think it might need at least mentioning it in the upgrade notes?

They ignore ARVADOS_API_HOST, ARVADOS_API_HOST_INSECURE, and ARVADOS_API_TOKEN. They do not ignore ARVADOS_DEBUG, ARVADOS_KEEP_SERVICES, etc.

Depending on how you look at it, "ignore" means either "config.yml takes precedence" or "those vars have no significance to them".

What I tried to say is that ideally all envvars would have precedence over the rest of configuration means, and that it's unfortunate that now some have, and some other don't (inconsistency in behavior depending on which envvar we're looking at). A warning message I think is a good mitigation measure, and simple to implement.

OK, I'll do that.

Lucas Di Pentima wrote in #note-13:

Also, I tried but couldn't find the ticket where we made RailsAPI serve keep services in config file if the table was empty, but I kind of remember we left this behavior for migration purposes. If that's the case, the fact we (ab)use it to do testing is also unfortunate. We could take the opportunity of the 3.0 release to remove it entirely and use envvars to do testing. Not sure if that would be a entire project in itself, though.

"Migrating the tests (which expect to bring keep services up & down without restarting RailsAPI) might be painful" --me in #13647#note-26 (2019)

We might be able to accomplish this with env vars plus
  • a way to specify each keepstore's rendezvous ID in the env var
  • restart keepproxy and keep-web with updated env vars, instead of updating the database and then sending them SIGHUP as we do now

It would be nice to clean that up now, and eliminate the need for ARVADOS_USE_KEEP_ACCESSIBLE_API.

But if it turns into a time-sink I might reconsider.

Actions #15

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-08-28 sprint to Development 2024-09-11 sprint
Actions #16

Updated by Tom Clegg 4 months ago

Well, it turned into a time-sink. I think 21773-test-without-keep-service-table @ a3bfdf37c7 is on the right track (instead of updating the keep_services table, run_test_servers updates the config file and sends SIGHUP to keepproxy/keep-web, which then restart with updated config) but it's dragging on a bit.

Meanwhile we could merge the "override with env var" workaround and achieve the original stated goal here, and defer fixing the "tests still use the keep_services table" issue.

Here it is with the "env vars will not be used" warning at startup.

21773-keep-service-discovery @ a3bfdf37c7a0f3a5edf15357475a35079f7db473 -- developer-run-tests: #4426

Actions #17

Updated by Tom Clegg 4 months ago

The test failure was an unrelated test in sdk/go/arvados (clientRetrySuite.TestExponentialBackoff). It is inherently flaky because it checks that a finite series of randomized delays look random, which is not always true. I've updated that test to give up after 30 attempts instead of 20. Running that test 100x on my dev machine resulted in a 2% failure rate with previous code. Running 100x with the change to 30 resulted in 0 failures.

21773-keep-service-discovery @ 18f8ce0ea4d0760d86a0a3100ecf31e4c7cfa465 -- developer-run-tests: #4434

Actions #18

Updated by Lucas Di Pentima 3 months ago

21773-keep-service-discovery LGTM, thanks.

Actions #19

Updated by Tom Clegg 3 months ago

  • Related to Feature #22114: Remove test suite's dependence on keep_services table added
Actions #20

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved
Actions #21

Updated by Lucas Di Pentima 3 months ago

  • Related to Bug #22124: Cannot upload data through keep-web (workbench) added
Actions

Also available in: Atom PDF