Project

General

Profile

Actions

Bug #18004

closed

Cached federated token creation race condition on RailsAPI

Added by Lucas Di Pentima over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release relationship:
Auto

Description

We got a report that some users were sometimes seeing 502 errors on wb2 when accessing collections, nginx error log show a primary key violation on insert:

App 18334 stderr: [ 2021-08-06 09:54:55.4629 18396/0x00000000022f0380(Worker 1) utils.rb:87 ]: *** Exception ActiveRecord::RecordNotUnique in Rack application object (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_api_client_authorizations_on_api_token" 
App 18334 stderr: DETAIL:  Key (api_token)=(TOKEN) already exists.
App 18334 stderr: : INSERT INTO "api_client_authorizations" ("api_token", "api_client_id", "user_id", "expires_at", "created_at", "updated_at", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING "id") (process 18396, thread 0x00000000022f0380(Worker 1)):
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in `async_exec_params'
[...]
App 18334 stderr: from /var/www/arvados-api/current/app/models/arvados_model.rb:145:in `create'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:85:in `block in create'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:281:in `scoping'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:85:in `create'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:164:in `find_or_create_by'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/querying.rb:8:in `find_or_create_by'
App 18334 stderr: from /var/www/arvados-api/current/app/models/api_client_authorization.rb:367:in `block in validate'
App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:138:in `block in act_as_system_user'
App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:149:in `act_as_user'
App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:137:in `act_as_system_user'
App 18334 stderr: from /var/www/arvados-api/current/app/models/api_client_authorization.rb:312:in `validate'
App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:52:in `block in call'
App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:49:in `each'
App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:49:in `call'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/engine.rb:524:in `call'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/railtie.rb:190:in `public_send'
App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/railtie.rb:190:in `method_missing'
App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in `process_request'
App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:152:in `accept_and_process_next_request'
App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:113:in `main_loop'
App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler.rb:416:in `block (3 levels) in start_threads'
App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception'
[ 2021-08-06 09:54:55.4633 18288/7f15cbfff700 age/Cor/Con/InternalUtils.cpp:112 ]: [Client 2-787] Sending 502 response: application did not send a complete response

It seems the problem is related to a race when the remote token isn't cached on the cluster and many request for the same token are in flight, some of them will try to create the same cached token producing a collision.


Subtasks 1 (0 open1 closed)

Task #18025: Review 18004-cached-token-race-conditionResolvedTom Clegg08/17/2021Actions
Actions #1

Updated by Lucas Di Pentima over 2 years ago

Test exposing the bug at 0be7e79 - branch 18004-cached-token-race-condition
Test run with failure: developer-run-tests-remainder: #2730

Actions #2

Updated by Tom Clegg over 2 years ago

Maybe we just need something like this (from users_controller.rb):

      begin
        u = User.find_or_create_by(uuid: uuid)
      rescue ActiveRecord::RecordNotUnique
        retry
      end

...but we should make sure to only retry once, not infinity times, in case RecordNotUnique is caused by a constraint other than uuid.

Actions #3

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Lucas Di Pentima over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #5

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from Resolved to In Progress
Actions #6

Updated by Lucas Di Pentima over 2 years ago

Updates at 516669845
Test run: developer-run-tests: #2643

  • Found (and fixed) an additional race condition bug (both detected by the same test): the user record creation code suffered from the same issue as the token code.
  • Locally ran the test 10 times in a row without failure (takes ~70s per run), every run creates 100 concurrent requests so I'm pretty confident the problem is fixed.
Actions #7

Updated by Lucas Di Pentima over 2 years ago

  • Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Actions #8

Updated by Tom Clegg over 2 years ago

LGTM, thanks!

Actions #9

Updated by Lucas Di Pentima over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF