Project

General

Profile

Actions

Bug #18076

closed

Properly handle cached user records that don't exist anymore on federation scenarios

Added by Lucas Di Pentima over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release relationship:
Auto

Description

If for some reason a user's UUID on the LoginCluster changes (or the user record is removed), the cached copies on the federated clusters get stale and probably cause username validation issues when trying to cache the new record.
The federated controllers use the batch_update internal API to refresh the user list cache but it doesn't take into account that some users may not exist anymore so cache records aren't properly cleaned/handled.


Subtasks 2 (0 open2 closed)

Task #18083: Review 18076-stale-cached-users-handlingResolvedLucas Di Pentima09/02/2021Actions
Task #18110: Review 18076-user-cache-with-repositoryResolvedLucas Di Pentima09/06/2021Actions

Related issues

Related to Arvados - Bug #18094: Remove unused Users#update_uuid endpointResolvedLucas Di Pentima09/03/2021Actions
Actions #1

Updated by Peter Amstutz over 2 years ago

  • Target version set to 2021-09-15 sprint
Actions #2

Updated by Peter Amstutz over 2 years ago

  • Assigned To set to Lucas Di Pentima
Actions #3

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Lucas Di Pentima over 2 years ago

  • Related to Bug #18094: Remove unused Users#update_uuid endpoint added
Actions #5

Updated by Lucas Di Pentima over 2 years ago

Updates at 7aa1380 - branch 18076-stale-cached-users-handling
Test run: developer-run-tests: #2670

If there's a stale cached user record creating the username field collision, set it to nil before retrying the update. There're two scenarios this is covering:

  1. The stale user record belongs to an existing user on LoginCluster. Another user took its username, so the new username is coming in the /arvados/v1/users/batch_update operation -- it's ok to have it set to nil temporarily.
  2. The stale user record doesn't exist on LoginCluster anymore, so resetting it to nil isn't harmful and avoids future collisions.
Actions #6

Updated by Tom Clegg over 2 years ago

The fix looks great.

Just some nits about the test:
  • Limit: math.MaxInt64 can also be spelled Limit: -1 (probably easier to read in logs)
  • The test first checks whether all clusters use z1111 as their login cluster, but all it really needs to be effective is for z1111 and z3333 to use z1111 as login cluster. (I figure eventually we'll add another cluster to our integration test suite with a different config, and at that point it will be nice to have fewer test failures to fix.)
  • for userNr := range []int{0, 1} is sort of misleading in that userNr is the array index, not the array element -- []int{9,3} would have the same result. Idiomatic Go is to say for userNr := 0; userNr < 2; userNr++ if you want a sequence of numbers, or for _, userNr := range ... if you want specific int values. This is a very nitpicky nitpick though!

LGTM, thanks!

Actions #7

Updated by Lucas Di Pentima over 2 years ago

Thanks for the suggestions! Updated at c534e74c5 and merged.

Actions #8

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from In Progress to Resolved
Actions #9

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from Resolved to In Progress
Actions #10

Updated by Lucas Di Pentima over 2 years ago

Once Jenkins deployed this on ce8i5, I started to see the following:

$ arv user list
Error: error updating local user records: //railsapi.internal/arvados/v1/users/batch_update: 422 Unprocessable Entity: #<ActiveRecord::RecordNotSaved: Failed to save the record> (req-vr2et8tnawgndjcxocwb)

...and in ce8i5's logs:

[req-vr2et8tnawgndjcxocwb] cached username 'ward2' collision with user 'ce8i5-tpzed-249xu4osx4us5lg' - resetting
#<ActiveRecord::RecordNotSaved: Failed to save the record>
Error 1630934673+e96dc935: 422
{"method":"PATCH","path":"/arvados/v1/users/batch_update","format":"html","controller":"Arvados::V1::UsersController","action":"batch_update","status":422,"duration":74.54,"view":0.32,"db":22.33,"request_id":null,"client_ipaddr":"127.0.0.1","client_auth":"tordo-gj3su-000000000000000","exception":"#\u003cActiveRecord::RecordNotSaved: Failed to save the record\u003e","exception_backtrace":"/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activerecord-5.2.6/lib/active_record/persistence.rb:308:in `save!'\n/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activerecord-5.2.6/lib/active_record/validations.rb:52:in `save!'\n/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activereco[...]

From reading the Rails docs, the ActiveRecord::RecordNotSaved exception is raised when a before_* callback fails when calling save!(), so I thought there was some issue with resetting the user's username... and when trying to do so manually, I got:

$ arv user update -b --uuid ce8i5-tpzed-249xu4osx4us5lg --user='{"username": null}'
Error: //railsapi.internal/arvados/v1/users/ce8i5-tpzed-249xu4osx4us5lg: 422 Unprocessable Entity: Username can't be unset when the user owns repositories (req-wt8kut96pkj91l187pjn)

So I guess this is what's triggering the problem on the users model:

  before_update :verify_repositories_empty, :if => Proc.new {
    username.nil? and username_changed?
  }
Actions #11

Updated by Lucas Di Pentima over 2 years ago

My approach would be to set the conflicting username to something random (maybe a random suffix?) instead of nil, as it seems that this wouldn't cause any issues with preexisting repositories.

Actions #12

Updated by Tom Clegg over 2 years ago

Ugh! Yeah, a random string/suffix sounds like a decent workaround.

Actions #13

Updated by Lucas Di Pentima over 2 years ago

Fix at 7edaf0605 - branch 18076-user-cache-with-repository
Test run: developer-run-tests: #2676
WB1 integration test re-run: developer-run-tests-apps-workbench-integration: #2837

  • Sets a "conflictNNNNNNN" random suffix to the conflicting user's username instead of nil.
  • Expands the previous test.
Actions #14

Updated by Tom Clegg over 2 years ago

7edaf0605 LGTM, thanks!

Actions #15

Updated by Lucas Di Pentima over 2 years ago

  • % Done changed from 50 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|f860ff683634d73edd63df32c284aba42f1bbf0c.

Actions #16

Updated by Peter Amstutz over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF