Bug #18076

Properly handle cached user records that don't exist anymore on federation scenarios

Added by Lucas Di Pentima about 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
09/02/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

If for some reason a user's UUID on the LoginCluster changes (or the user record is removed), the cached copies on the federated clusters get stale and probably cause username validation issues when trying to cache the new record.
The federated controllers use the batch_update internal API to refresh the user list cache but it doesn't take into account that some users may not exist anymore so cache records aren't properly cleaned/handled.


Subtasks

Task #18083: Review 18076-stale-cached-users-handlingResolvedLucas Di Pentima

Task #18110: Review 18076-user-cache-with-repositoryResolvedLucas Di Pentima


Related issues

Related to Arvados - Bug #18094: Remove unused Users#update_uuid endpointResolved09/03/2021

Associated revisions

Revision 726f6515
Added by Lucas Di Pentima about 2 months ago

Merge branch '18076-stale-cached-users-handling' into main. Closes #18076

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz about 2 months ago

  • Target version set to 2021-09-15 sprint

#2 Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Lucas Di Pentima

#3 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from New to In Progress

#4 Updated by Lucas Di Pentima about 2 months ago

  • Related to Bug #18094: Remove unused Users#update_uuid endpoint added

#5 Updated by Lucas Di Pentima about 2 months ago

Updates at 7aa1380 - branch 18076-stale-cached-users-handling
Test run: https://ci.arvados.org/job/developer-run-tests/2670/

If there's a stale cached user record creating the username field collision, set it to nil before retrying the update. There're two scenarios this is covering:

  1. The stale user record belongs to an existing user on LoginCluster. Another user took its username, so the new username is coming in the /arvados/v1/users/batch_update operation -- it's ok to have it set to nil temporarily.
  2. The stale user record doesn't exist on LoginCluster anymore, so resetting it to nil isn't harmful and avoids future collisions.

#6 Updated by Tom Clegg about 2 months ago

The fix looks great.

Just some nits about the test:
  • Limit: math.MaxInt64 can also be spelled Limit: -1 (probably easier to read in logs)
  • The test first checks whether all clusters use z1111 as their login cluster, but all it really needs to be effective is for z1111 and z3333 to use z1111 as login cluster. (I figure eventually we'll add another cluster to our integration test suite with a different config, and at that point it will be nice to have fewer test failures to fix.)
  • for userNr := range []int{0, 1} is sort of misleading in that userNr is the array index, not the array element -- []int{9,3} would have the same result. Idiomatic Go is to say for userNr := 0; userNr < 2; userNr++ if you want a sequence of numbers, or for _, userNr := range ... if you want specific int values. This is a very nitpicky nitpick though!

LGTM, thanks!

#7 Updated by Lucas Di Pentima about 2 months ago

Thanks for the suggestions! Updated at c534e74c5 and merged.

#8 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from In Progress to Resolved

#9 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from Resolved to In Progress

#10 Updated by Lucas Di Pentima about 2 months ago

Once Jenkins deployed this on ce8i5, I started to see the following:

$ arv user list
Error: error updating local user records: //railsapi.internal/arvados/v1/users/batch_update: 422 Unprocessable Entity: #<ActiveRecord::RecordNotSaved: Failed to save the record> (req-vr2et8tnawgndjcxocwb)

...and in ce8i5's logs:

[req-vr2et8tnawgndjcxocwb] cached username 'ward2' collision with user 'ce8i5-tpzed-249xu4osx4us5lg' - resetting
#<ActiveRecord::RecordNotSaved: Failed to save the record>
Error 1630934673+e96dc935: 422
{"method":"PATCH","path":"/arvados/v1/users/batch_update","format":"html","controller":"Arvados::V1::UsersController","action":"batch_update","status":422,"duration":74.54,"view":0.32,"db":22.33,"request_id":null,"client_ipaddr":"127.0.0.1","client_auth":"tordo-gj3su-000000000000000","exception":"#\u003cActiveRecord::RecordNotSaved: Failed to save the record\u003e","exception_backtrace":"/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activerecord-5.2.6/lib/active_record/persistence.rb:308:in `save!'\n/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activerecord-5.2.6/lib/active_record/validations.rb:52:in `save!'\n/var/www/arvados-api/shared/vendor_bundle/ruby/2.5.0/gems/activereco[...]

From reading the Rails docs, the ActiveRecord::RecordNotSaved exception is raised when a before_* callback fails when calling save!(), so I thought there was some issue with resetting the user's username... and when trying to do so manually, I got:

$ arv user update -b --uuid ce8i5-tpzed-249xu4osx4us5lg --user='{"username": null}'
Error: //railsapi.internal/arvados/v1/users/ce8i5-tpzed-249xu4osx4us5lg: 422 Unprocessable Entity: Username can't be unset when the user owns repositories (req-wt8kut96pkj91l187pjn)

So I guess this is what's triggering the problem on the users model:

  before_update :verify_repositories_empty, :if => Proc.new {
    username.nil? and username_changed?
  }

#11 Updated by Lucas Di Pentima about 2 months ago

My approach would be to set the conflicting username to something random (maybe a random suffix?) instead of nil, as it seems that this wouldn't cause any issues with preexisting repositories.

#12 Updated by Tom Clegg about 2 months ago

Ugh! Yeah, a random string/suffix sounds like a decent workaround.

#13 Updated by Lucas Di Pentima about 2 months ago

Fix at 7edaf0605 - branch 18076-user-cache-with-repository
Test run: https://ci.arvados.org/job/developer-run-tests/2676/
WB1 integration test re-run: https://ci.arvados.org/job/developer-run-tests-apps-workbench-integration/2837/

  • Sets a "conflictNNNNNNN" random suffix to the conflicting user's username instead of nil.
  • Expands the previous test.

#14 Updated by Tom Clegg about 2 months ago

7edaf0605 LGTM, thanks!

#15 Updated by Lucas Di Pentima about 2 months ago

  • % Done changed from 50 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|f860ff683634d73edd63df32c284aba42f1bbf0c.

Also available in: Atom PDF