Actions
Scaling things¶
In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them.
Collection size¶
Collections with a large number of files- Slowness due to large manifest being sent over the network in order to load/update a single file
- High memory usage (in several components) due to large manifest
Total data size¶
Large number of blocks- High memory usage in keep-balance
- High garbage collection / replication adjustment latency due to long keep-balance iterations
- High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration)
Container queue size¶
Large number of queued containers- Higher dispatcher latency due to reloading entire queue
- Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds
- Scheduling/prioritization effects when cloud services are limited (e.g., instance quota)
- High dispatcher memory use (function of # queued+running, not just # running)
- Lock contention due to cascading container/container request updates
- Controller/rails bottleneck causes container/log/output updates to take longer
- Interactive usage suffers when controller/rails is busy servicing many containers
Updated by Tom Clegg over 1 year ago · 1 revisions