Bug #13593

[API] Sequence of "create group" requests runs slowly, and can crash API server

Added by Peter Amstutz over 1 year ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/12/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto

Description

A user reported that creating many projects at once resulted in extreme slowdown and eventually crashing the API server unless the create requests were throttled.

Creating a project (or any other group) also refreshes the permissions graph in the same transaction.

This makes the API call slow, but we intend/expect that other API calls can proceed while the update is in progress, so other (non-permission-modifying) API calls aren't expected to be slowed down.

The approach would be:

  • Accept an async param to permission-changing calls that would delay the permission graph update for a server configurable number of seconds in the future.
  • When updating the permission graph asynchronously, use the CONCURRENTLY parameter to avoid blocking SELECT while it’s being updated. (PostgreSQL 9.4 feature, check API server’s dependency requirements).
  • Use status 202 (Accepted) on async call responses.
  • Make a performance analysis comparison to make sure the issue is solved.
  • Update documentation.

Subtasks

Task #14895: Review 13593-async-perm-graph-updateResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #14966: [API] Fix hanging test - suspect permission changesResolved03/13/2019

Blocks Arvados - Feature #13688: [API] optional asynchronous flag for create/update operationsNew

Associated revisions

Revision 5b096288
Added by Lucas Di Pentima 7 months ago

Merge branch '13593-async-perm-graph-update'
Closes #13593

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#3 Updated by Tom Morris over 1 year ago

  • Status changed from In Progress to New
  • Target version set to To Be Groomed

#4 Updated by Peter Amstutz over 1 year ago

  • Subject changed from Batch create groups to Creating many groups is slow

#5 Updated by Peter Amstutz over 1 year ago

  • Tracker changed from Feature to Bug
  • Description updated (diff)

#6 Updated by Tom Clegg over 1 year ago

  • Subject changed from Creating many groups is slow to [API] Sequence of "create group" requests runs slowly, and can crash API server
  • Description updated (diff)

#7 Updated by Tom Clegg over 1 year ago

  • Description updated (diff)

#8 Updated by Tom Clegg over 1 year ago

  • Blocks Feature #13688: [API] optional asynchronous flag for create/update operations added

#9 Updated by Tom Clegg over 1 year ago

  • Description updated (diff)

#10 Updated by Peter Amstutz 8 months ago

At Ward's suggestion, I'm bumping this since it's still an outstanding user request.

#11 Updated by Lucas Di Pentima 8 months ago

  • Description updated (diff)

#12 Updated by Lucas Di Pentima 8 months ago

  • Description updated (diff)
  • Story points set to 3.0

#13 Updated by Lucas Di Pentima 8 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints

#14 Updated by Tom Morris 8 months ago

  • Target version changed from Arvados Future Sprints to 2019-03-13 Sprint

#15 Updated by Lucas Di Pentima 8 months ago

  • Assigned To set to Lucas Di Pentima

#16 Updated by Lucas Di Pentima 8 months ago

  • Status changed from New to In Progress

#17 Updated by Tom Morris 8 months ago

  • Release set to 15

#18 Updated by Lucas Di Pentima 7 months ago

Updates at 653f7d5a9 - branch 13593-async-perm-graph-update
Test run: https://ci.curoverse.com/job/developer-run-tests/1113/

Updates

  • New configuration knob: async_permissions_update_interval to hold the number of seconds a permission update will be deferred (default: 60)
  • When passing async: true to a group's update or create call, a new thread is created that waits the number of seconds before doing a permission graph update.
  • This deferred graph update is de-duplicated, so in case many async calls are made within the configure time, only one thread is created.
  • The async calls return 202 Accepted on success
  • API docs updated

The CONCURRENTLY parameter wasn't added because the permission view needs an additional index.

Performance analysis

The test environment consisted on a newly created arvbox instance on a 4GB RAM Virtualbox VM.
The measurements were done by subsequently creating new groups (as a non-admin user) in batchs of 1000 groups.
  • First, group create calls using the current API server:
    • 1st try: 0.158 secs/group
    • 2nd try: 0.171 secs/group
    • 3rd try: 0.191 secs/group
    • 4th try: 0.218 secs/group
    • 5th try: 0.225 secs/group (timings continue to rise as more groups are created)
  • Then, group create calls using the new async parameter (without removing previous groups):
    • 1st try: 0.065 secs/group
    • 2nd try: 0.059 secs/group
    • 3rd try: 0.06 secs/group
    • 4th try: 0.063 secs/group

This difference (almost 4x) may appear not so significant, but we should take into account the fact that is an almost empty instance.

The script used to do the benchmarks:

require 'arvados'

group_name = (0...8).map { (65 + rand(26)).chr }.join
group_qty = 1000
api = Arvados.new

puts "Starting creation of #{group_qty} groups..." 

start_t = Time.now.to_i
(0..group_qty).each do |group_num|
    api.group.create(group: {name: "#{group_name} #{group_num}"}, async: true)
end
end_t = Time.now.to_i

puts "Finished! Avg group creation time: #{1.0*(end_t-start_t)/group_qty} secs" 

#19 Updated by Peter Amstutz 7 months ago

    exp = Rails.configuration.async_permissions_update_interval.seconds
    Rails.cache.fetch('AsyncRefreshPermissionView', expires_in: exp) do
      # Schedule a new permission update and return immediately
      Thread.new do
        Thread.current.abort_on_exception = false
        begin
          sleep(exp)
          do_refresh_permission_view
        rescue => e
          Rails.logger.error "Updating permission view: #{e}\n#{e.backtrace.join("\n\t")}" 
        ensure
          ActiveRecord::Base.connection.close
        end
      end
      true
    end

I feel like there's a potential race condition here if AsyncRefreshPermissionView does not expire before the do_refresh_permission_view starts.

I can kind of convince myself that shouldn't happen, that AsyncRefreshPermissionView should always expire a few microseconds before sleep() returns, but with timing it is always hard to be sure (for example, what if the system clock is changed?). Maybe add Rails.cache.delete('AsyncRefreshPermissionView') between sleep(exp) and do_refresh_permission_view?

I feel like 60 seconds is a long time for the refresh period. A default of 15 or 20 seconds feels better. The ideal delay is probably be slightly longer than it takes to complete one permission table update, since the real goal here is to avoid delaying API calls and loading the database by stacking up a bunch of mostly redundant permission updates.

I noticed this:

  def self.invalidate_permissions_cache(timestamp=nil)
    if Rails.configuration.async_permissions_update
      timestamp = DbCurrentTime::db_current_time.to_i if timestamp.nil?
      connection.execute "NOTIFY invalidate_permissions_cache, '#{timestamp}'" 
    else
      refresh_permission_view
    end
  end

I checked and async_permissions_update and the invalidate_permissions_cache message are from #8886 which was a previous attempt at asynchronous permission updates that predates both the materialized permission view and (obviously) this branch. I think that code can be removed entirely so that invalidate_permissions_cache just calls refresh_permission_view.

#20 Updated by Peter Amstutz 7 months ago

From chat:

It would be good to have a test that confirms the permission update actually happened. Suggest a test that:

  1. sets permission update period to 5s
  2. does an async update
  3. tests permission (shouldn't be updated yet)
  4. waits 5 seconds
  5. tests permission again (should be updated now)

#21 Updated by Lucas Di Pentima 7 months ago

Updates at 7618631a4
Test run: https://ci.curoverse.com/job/developer-run-tests/1115/

  • As suggested, changed from 60 to 20 seconds the default interval config
  • Removed old code related to experimental async permissions update
  • Enhanced test to not only check for status 202 Accepted, but also check that permission updates are actually being deferred.

#22 Updated by Lucas Di Pentima 7 months ago

  • Target version changed from 2019-03-13 Sprint to 2019-03-27 Sprint
  • Story points changed from 3.0 to 1.0

#23 Updated by Peter Amstutz 7 months ago

Lucas Di Pentima wrote:

Updates at 7618631a4
Test run: https://ci.curoverse.com/job/developer-run-tests/1115/

  • As suggested, changed from 60 to 20 seconds the default interval config
  • Removed old code related to experimental async permissions update
  • Enhanced test to not only check for status 202 Accepted, but also check that permission updates are actually being deferred.

This LGTM

#24 Updated by Lucas Di Pentima 7 months ago

  • Target version changed from 2019-03-27 Sprint to 2019-03-13 Sprint
  • Story points changed from 1.0 to 3.0

#25 Updated by Lucas Di Pentima 7 months ago

  • Status changed from In Progress to Resolved

#26 Updated by Tom Clegg 7 months ago

  • Related to Bug #14966: [API] Fix hanging test - suspect permission changes added

Also available in: Atom PDF