Idea #13484
closedSupport multiple load-balanced API server nodes
Description
Background: the Rails API server application becomes a performance bottleneck during heavy load, e.g., when hundreds of containers/nodes are running. There are some ways to respond to this -- use a bigger/faster machine, adjust logging configs, move postgresql to a different machine -- but it would be much better if the operator could add API server nodes to increase capacity. However, there are some parts of the code base that assume there's only one API server.
In this issue, we remove those barriers, so a site admin can safely add and remove additional API servers and route traffic to them with a load balancer.
(However, multi-API-server installations are not expected to support crunch1 jobs.)
Known/suspected issues:- Job validation code assumes git repositories are stored in the local filesystem (todo: confirm this only affects crunch1)
- Audit log cleanup code uses flock() to avoid wastefully running concurrent cleanup threads (todo: confirm concurrent cleanup threads are harmless, and/or use a database lock instead)
- Sample DNS update scripts (triggered by "node ping") assume the API host is the DNS server (todo: offer a sample DNS update strategy suitable for multiple nodes).
Related issues
Updated by Lucas Di Pentima over 6 years ago
Some observations/questions:
Potential issues:- RequestIDs: Is this solved by a smart load balancer?
- Multiple sweep trashed objects processes on every API server
- Is this configuration thing?
- Should be split that code into a separate service?
- How about the logs table being read from several api instances?
Updated by Tom Clegg over 6 years ago
- OK to waste some work due to every apiserver running a trash/log sweeping thread
- Load balancer must be configured to route all node ping requests to a single API server which is also the DNS server
- All API servers are shut down while any API server is being upgraded
- API servers are not aware of anything like "my ID" or what other API servers are running
Updated by Tom Morris over 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
Updated by Tom Morris over 5 years ago
- Target version changed from Arvados Future Sprints to 2019-03-27 Sprint
Updated by Peter Amstutz over 5 years ago
audit_logs.rb#delete_old creates tmp/audit_logs.lock
update_priority.rb#update_priority creates tmp/update_priority.lock
refresh_permission_view runs in a transaction and takes a table lock
sweep_trashed_objects#sweep_now does not set up explicit transaction but individual statements should be transactional. Seems like delete_project_and_contents should be in an explicit transaction at least.
crunch_dispatch.rb interacts with local slurm and git repo (specific to crunchv1)
commit_ancestor.rb and commit.rb read a local git repo (specific to crunchv1)
job.rb touches the file at crunch_refresh_trigger (specific to crunchv1)
The nodes table updates local DNS configuration. Used by node manager, could technically be used by on-prem configuration but I don't think anyone does.
Websocket events are triggered by database NOTIFY.
Async permission updates use local cache freshness to suppress updates. However if another node performs an update it is not harmful.
The discovery document is stored in local cache. There is a generatedAt field which is db_current_time, which means otherwise identical servers on different hosts are likely to report different generateAt times.
Login process uses a session. I'm not sure if this is necessary, or what the implications are if the user starts a session on one host and the next request goes to a different host.
Updated by Peter Amstutz over 5 years ago
Peter Amstutz wrote:
audit_logs.rb#delete_old creates tmp/audit_logs.lock
update_priority.rb#update_priority creates tmp/update_priority.lock
refresh_permission_view runs in a transaction and takes a table lock
I think these would benefit from being wrapped in explicit transactions, but there doesn't appear to be much benefit from the file lock, except in preventing overlapping threads if the operation takes longer than the quiet period based on Rails.cache expiration. If that is important, we could use a table lock instead.
sweep_trashed_objects#sweep_now does not set up explicit transaction but individual statements should be transactional. Seems like delete_project_and_contents should be in an explicit transaction at least.
Async permission updates use local cache freshness to suppress updates. However if another node triggers a permission update it is not harmful.
These doesn't have the file lock, but otherwise the same comment as above applies.
crunch_dispatch.rb interacts with local slurm and git repo (specific to crunchv1)
commit_ancestor.rb and commit.rb read a local git repo (specific to crunchv1)
job.rb touches the file at crunch_refresh_trigger (specific to crunchv1)
Multiple API hosts will not support crunchv1.
The nodes table updates local DNS configuration. Used by node manager, could technically be used by on-prem configuration but I don't think anyone does.
Multiple API hosts is incompatible with "classic" node management (node manager) where ping scripts and the nodes table triggers DNS updates.
Should be fine using either crunch-dispatch-slurm with a externally managed / static slurm cluster, or using crunch-dispatch-cloud.
Websocket events are triggered by database NOTIFY.
The discovery document is stored in local cache. There is a generatedAt field which is db_current_time, which means otherwise identical servers on different hosts are likely to report different generateAt times.
I don't think there is anything to do here.
Login process uses a session. I'm not sure if this is necessary, or what the implications are if the user starts a session on one host and the next request goes to a different host.
I need to research this some more.
Updated by Peter Amstutz over 5 years ago
There's some code from 2013 that sets values in the session, but I can't find anything that reads the session. So the session is probably irrelevant.
Updated by Peter Amstutz over 5 years ago
- Blocked by Idea #14987: [API] Upgrade to Rails 5 added
Updated by Peter Amstutz over 5 years ago
- Blocked by Idea #14873: [API] Update to Rails 5 added
Updated by Peter Amstutz over 5 years ago
- Blocked by deleted (Idea #14987: [API] Upgrade to Rails 5)
Updated by Peter Amstutz over 5 years ago
- Target version changed from 2019-03-27 Sprint to 2019-04-10 Sprint
Updated by Tom Morris over 5 years ago
- Target version changed from 2019-04-10 Sprint to Arvados Future Sprints
Updated by Ward Vandewege over 5 years ago
- Blocked by Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Updated by Peter Amstutz almost 5 years ago
- Target version changed from Arvados Future Sprints to 2020-03-11 Sprint
Updated by Peter Amstutz almost 5 years ago
- Subject changed from [API] Support multiple load-balanced API server nodes to Support multiple load-balanced API server nodes
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-03-11 Sprint to 2020-03-25 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-03-25 Sprint to 2020-04-08 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-04-08 Sprint to 2020-04-22
- Assigned To deleted (
Ward Vandewege)
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-04-22 to 2020-05-06 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-05-06 Sprint to 2020-05-20 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-05-20 Sprint to 2020-06-03 Sprint
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-06-03 Sprint to 2020-06-17 Sprint
Updated by Peter Amstutz over 4 years ago
- Related to Idea #15941: arvados-boot added
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-06-17 Sprint to 2020-07-01 Sprint
- Assigned To set to Ward Vandewege
Updated by Peter Amstutz over 4 years ago
- Target version changed from 2020-07-01 Sprint to 2020-07-15
Updated by Ward Vandewege over 4 years ago
- Target version changed from 2020-07-15 to Arvados Future Sprints
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Peter Amstutz 24 days ago
- Release deleted (
60) - Target version deleted (
Future) - Status changed from New to Resolved
This configuration has been used in production for a while now.