Project

General

Profile

Actions

Idea #13484

closed

Support multiple load-balanced API server nodes

Added by Tom Clegg over 6 years ago. Updated 24 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
-
Start date:
Due date:
Story points:
2.0

Description

Background: the Rails API server application becomes a performance bottleneck during heavy load, e.g., when hundreds of containers/nodes are running. There are some ways to respond to this -- use a bigger/faster machine, adjust logging configs, move postgresql to a different machine -- but it would be much better if the operator could add API server nodes to increase capacity. However, there are some parts of the code base that assume there's only one API server.

In this issue, we remove those barriers, so a site admin can safely add and remove additional API servers and route traffic to them with a load balancer.

(However, multi-API-server installations are not expected to support crunch1 jobs.)

Known/suspected issues:
  • Job validation code assumes git repositories are stored in the local filesystem (todo: confirm this only affects crunch1)
  • Audit log cleanup code uses flock() to avoid wastefully running concurrent cleanup threads (todo: confirm concurrent cleanup threads are harmless, and/or use a database lock instead)
  • Sample DNS update scripts (triggered by "node ping") assume the API host is the DNS server (todo: offer a sample DNS update strategy suitable for multiple nodes).

Related issues

Related to Arvados Epics - Idea #15941: arvados-bootNewActions
Blocked by Arvados - Idea #14873: [API] Update to Rails 5ResolvedLucas Di Pentima03/20/2019Actions
Blocked by Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Actions #1

Updated by Lucas Di Pentima over 6 years ago

Some observations/questions:

Potential issues:
  • RequestIDs: Is this solved by a smart load balancer?
  • Multiple sweep trashed objects processes on every API server
    • Is this configuration thing?
    • Should be split that code into a separate service?
  • How about the logs table being read from several api instances?
Actions #2

Updated by Tom Clegg over 6 years ago

Simplifying assumptions
  • OK to waste some work due to every apiserver running a trash/log sweeping thread
  • Load balancer must be configured to route all node ping requests to a single API server which is also the DNS server
  • All API servers are shut down while any API server is being upgraded
  • API servers are not aware of anything like "my ID" or what other API servers are running
Actions #3

Updated by Tom Clegg over 6 years ago

  • Story points set to 1.0
Actions #4

Updated by Tom Morris over 6 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #5

Updated by Tom Morris over 5 years ago

  • Target version changed from Arvados Future Sprints to 2019-03-27 Sprint
Actions #6

Updated by Tom Morris over 5 years ago

  • Story points changed from 1.0 to 2.0
Actions #7

Updated by Tom Morris over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz over 5 years ago

audit_logs.rb#delete_old creates tmp/audit_logs.lock

update_priority.rb#update_priority creates tmp/update_priority.lock

refresh_permission_view runs in a transaction and takes a table lock

sweep_trashed_objects#sweep_now does not set up explicit transaction but individual statements should be transactional. Seems like delete_project_and_contents should be in an explicit transaction at least.

crunch_dispatch.rb interacts with local slurm and git repo (specific to crunchv1)

commit_ancestor.rb and commit.rb read a local git repo (specific to crunchv1)

job.rb touches the file at crunch_refresh_trigger (specific to crunchv1)

The nodes table updates local DNS configuration. Used by node manager, could technically be used by on-prem configuration but I don't think anyone does.

Websocket events are triggered by database NOTIFY.

Async permission updates use local cache freshness to suppress updates. However if another node performs an update it is not harmful.

The discovery document is stored in local cache. There is a generatedAt field which is db_current_time, which means otherwise identical servers on different hosts are likely to report different generateAt times.

Login process uses a session. I'm not sure if this is necessary, or what the implications are if the user starts a session on one host and the next request goes to a different host.

Actions #9

Updated by Peter Amstutz over 5 years ago

Peter Amstutz wrote:

audit_logs.rb#delete_old creates tmp/audit_logs.lock

update_priority.rb#update_priority creates tmp/update_priority.lock

refresh_permission_view runs in a transaction and takes a table lock

I think these would benefit from being wrapped in explicit transactions, but there doesn't appear to be much benefit from the file lock, except in preventing overlapping threads if the operation takes longer than the quiet period based on Rails.cache expiration. If that is important, we could use a table lock instead.

sweep_trashed_objects#sweep_now does not set up explicit transaction but individual statements should be transactional. Seems like delete_project_and_contents should be in an explicit transaction at least.

Async permission updates use local cache freshness to suppress updates. However if another node triggers a permission update it is not harmful.

These doesn't have the file lock, but otherwise the same comment as above applies.

crunch_dispatch.rb interacts with local slurm and git repo (specific to crunchv1)

commit_ancestor.rb and commit.rb read a local git repo (specific to crunchv1)

job.rb touches the file at crunch_refresh_trigger (specific to crunchv1)

Multiple API hosts will not support crunchv1.

The nodes table updates local DNS configuration. Used by node manager, could technically be used by on-prem configuration but I don't think anyone does.

Multiple API hosts is incompatible with "classic" node management (node manager) where ping scripts and the nodes table triggers DNS updates.

Should be fine using either crunch-dispatch-slurm with a externally managed / static slurm cluster, or using crunch-dispatch-cloud.

Websocket events are triggered by database NOTIFY.

The discovery document is stored in local cache. There is a generatedAt field which is db_current_time, which means otherwise identical servers on different hosts are likely to report different generateAt times.

I don't think there is anything to do here.

Login process uses a session. I'm not sure if this is necessary, or what the implications are if the user starts a session on one host and the next request goes to a different host.

I need to research this some more.

Actions #10

Updated by Peter Amstutz over 5 years ago

There's some code from 2013 that sets values in the session, but I can't find anything that reads the session. So the session is probably irrelevant.

Actions #13

Updated by Peter Amstutz over 5 years ago

  • Blocked by Idea #14987: [API] Upgrade to Rails 5 added
Actions #14

Updated by Peter Amstutz over 5 years ago

  • Blocked by Idea #14873: [API] Update to Rails 5 added
Actions #15

Updated by Peter Amstutz over 5 years ago

  • Blocked by deleted (Idea #14987: [API] Upgrade to Rails 5)
Actions #16

Updated by Peter Amstutz over 5 years ago

  • Target version changed from 2019-03-27 Sprint to 2019-04-10 Sprint
Actions #17

Updated by Tom Morris over 5 years ago

  • Target version changed from 2019-04-10 Sprint to Arvados Future Sprints
Actions #18

Updated by Peter Amstutz over 5 years ago

  • Assigned To deleted (Peter Amstutz)
Actions #19

Updated by Ward Vandewege over 5 years ago

  • Blocked by Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #20

Updated by Peter Amstutz almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2020-03-11 Sprint
Actions #21

Updated by Peter Amstutz almost 5 years ago

  • Subject changed from [API] Support multiple load-balanced API server nodes to Support multiple load-balanced API server nodes
Actions #23

Updated by Peter Amstutz over 4 years ago

  • Assigned To set to Ward Vandewege
Actions #24

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-03-11 Sprint to 2020-03-25 Sprint
Actions #25

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-03-25 Sprint to 2020-04-08 Sprint
Actions #26

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-04-08 Sprint to 2020-04-22
  • Assigned To deleted (Ward Vandewege)
Actions #27

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-04-22 to 2020-05-06 Sprint
Actions #28

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-05-06 Sprint to 2020-05-20 Sprint
Actions #29

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-05-20 Sprint to 2020-06-03 Sprint
Actions #30

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint
Actions #31

Updated by Peter Amstutz over 4 years ago

Actions #33

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint
  • Assigned To set to Ward Vandewege
Actions #34

Updated by Peter Amstutz over 4 years ago

  • Target version changed from 2020-07-01 Sprint to 2020-07-15
Actions #35

Updated by Ward Vandewege over 4 years ago

  • Target version changed from 2020-07-15 to Arvados Future Sprints
Actions #36

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #37

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #38

Updated by Peter Amstutz 9 months ago

  • Target version set to Future
Actions #39

Updated by Peter Amstutz 24 days ago

  • Release deleted (60)
  • Target version deleted (Future)
  • Status changed from New to Resolved

This configuration has been used in production for a while now.

Actions

Also available in: Atom PDF