Scaling things » History » Version 1
Tom Clegg, 06/07/2023 03:02 PM
1 | 1 | Tom Clegg | h1. Scaling things |
---|---|---|---|
2 | |||
3 | In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them. |
||
4 | |||
5 | h2. Collection size |
||
6 | |||
7 | Collections with a large number of files |
||
8 | * Slowness due to large manifest being sent over the network in order to load/update a single file |
||
9 | * High memory usage (in several components) due to large manifest |
||
10 | |||
11 | h2. Total data size |
||
12 | |||
13 | Large number of blocks |
||
14 | * High memory usage in keep-balance |
||
15 | * High garbage collection / replication adjustment latency due to long keep-balance iterations |
||
16 | * High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration) |
||
17 | |||
18 | h2. Container queue size |
||
19 | |||
20 | Large number of queued containers |
||
21 | * Higher dispatcher latency due to reloading entire queue |
||
22 | * Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds |
||
23 | * Scheduling/prioritization effects when cloud services are limited (e.g., instance quota) |
||
24 | * High dispatcher memory use (function of # queued+running, not just # running) |
||
25 | |||
26 | Large number of running containers |
||
27 | * Lock contention due to cascading container/container request updates |
||
28 | * Controller/rails bottleneck causes container/log/output updates to take longer |
||
29 | * Interactive usage suffers when controller/rails is busy servicing many containers |