Scaling things

In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them.

Collection size

Collections with a large number of files
  • Slowness due to large manifest being sent over the network in order to load/update a single file
  • High memory usage (in several components) due to large manifest

Total data size

Large number of blocks
  • High memory usage in keep-balance
  • High garbage collection / replication adjustment latency due to long keep-balance iterations
  • High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration)

Container queue size

Large number of queued containers
  • Higher dispatcher latency due to reloading entire queue
  • Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds
  • Scheduling/prioritization effects when cloud services are limited (e.g., instance quota)
  • High dispatcher memory use (function of # queued+running, not just # running)
Large number of running containers
  • Lock contention due to cascading container/container request updates
  • Controller/rails bottleneck causes container/log/output updates to take longer
  • Interactive usage suffers when controller/rails is busy servicing many containers

Updated by Tom Clegg 6 months ago · 1 revisions