Bug #18004
closedCached federated token creation race condition on RailsAPI
Description
We got a report that some users were sometimes seeing 502 errors on wb2 when accessing collections, nginx error log show a primary key violation on insert:
App 18334 stderr: [ 2021-08-06 09:54:55.4629 18396/0x00000000022f0380(Worker 1) utils.rb:87 ]: *** Exception ActiveRecord::RecordNotUnique in Rack application object (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_api_client_authorizations_on_api_token" App 18334 stderr: DETAIL: Key (api_token)=(TOKEN) already exists. App 18334 stderr: : INSERT INTO "api_client_authorizations" ("api_token", "api_client_id", "user_id", "expires_at", "created_at", "updated_at", "uuid") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING "id") (process 18396, thread 0x00000000022f0380(Worker 1)): App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in `async_exec_params' [...] App 18334 stderr: from /var/www/arvados-api/current/app/models/arvados_model.rb:145:in `create' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:85:in `block in create' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:281:in `scoping' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:85:in `create' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/relation.rb:164:in `find_or_create_by' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-5.2.6/lib/active_record/querying.rb:8:in `find_or_create_by' App 18334 stderr: from /var/www/arvados-api/current/app/models/api_client_authorization.rb:367:in `block in validate' App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:138:in `block in act_as_system_user' App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:149:in `act_as_user' App 18334 stderr: from /var/www/arvados-api/current/lib/current_api_client.rb:137:in `act_as_system_user' App 18334 stderr: from /var/www/arvados-api/current/app/models/api_client_authorization.rb:312:in `validate' App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:52:in `block in call' App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:49:in `each' App 18334 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:49:in `call' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/engine.rb:524:in `call' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/railtie.rb:190:in `public_send' App 18334 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-5.2.6/lib/rails/railtie.rb:190:in `method_missing' App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in `process_request' App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:152:in `accept_and_process_next_request' App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:113:in `main_loop' App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/request_handler.rb:416:in `block (3 levels) in start_threads' App 18334 stderr: from /usr/share/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception' [ 2021-08-06 09:54:55.4633 18288/7f15cbfff700 age/Cor/Con/InternalUtils.cpp:112 ]: [Client 2-787] Sending 502 response: application did not send a complete response
It seems the problem is related to a race when the remote token isn't cached on the cluster and many request for the same token are in flight, some of them will try to create the same cached token producing a collision.
Updated by Lucas Di Pentima over 3 years ago
Test exposing the bug at 0be7e79 - branch 18004-cached-token-race-condition
Test run with failure: developer-run-tests-remainder: #2730
Updated by Tom Clegg over 3 years ago
Maybe we just need something like this (from users_controller.rb):
begin
u = User.find_or_create_by(uuid: uuid)
rescue ActiveRecord::RecordNotUnique
retry
end
...but we should make sure to only retry once, not infinity times, in case RecordNotUnique is caused by a constraint other than uuid.
Updated by Lucas Di Pentima over 3 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|9214382fc7083d0310675ce3760144ad7b6cb070.
Updated by Lucas Di Pentima over 3 years ago
- Status changed from Resolved to In Progress
Updated by Lucas Di Pentima over 3 years ago
Updates at 516669845
Test run: developer-run-tests: #2643
- Found (and fixed) an additional race condition bug (both detected by the same test): the user record creation code suffered from the same issue as the token code.
- Locally ran the test 10 times in a row without failure (takes ~70s per run), every run creates 100 concurrent requests so I'm pretty confident the problem is fixed.
Updated by Lucas Di Pentima over 3 years ago
- Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Updated by Lucas Di Pentima about 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|22fcbed20b2691bcdfa5854004dcb95aa1f2e40d.