Project

General

Profile

Scaling things » History » Version 1

Tom Clegg, 06/07/2023 03:02 PM

1 1 Tom Clegg
h1. Scaling things
2
3
In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them.
4
5
h2. Collection size
6
7
Collections with a large number of files
8
* Slowness due to large manifest being sent over the network in order to load/update a single file
9
* High memory usage (in several components) due to large manifest
10
11
h2. Total data size
12
13
Large number of blocks
14
* High memory usage in keep-balance
15
* High garbage collection / replication adjustment latency due to long keep-balance iterations
16
* High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration)
17
18
h2. Container queue size
19
20
Large number of queued containers
21
* Higher dispatcher latency due to reloading entire queue
22
* Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds
23
* Scheduling/prioritization effects when cloud services are limited (e.g., instance quota)
24
* High dispatcher memory use (function of # queued+running, not just # running)
25
26
Large number of running containers
27
* Lock contention due to cascading container/container request updates
28
* Controller/rails bottleneck causes container/log/output updates to take longer
29
* Interactive usage suffers when controller/rails is busy servicing many containers