Version 1 - History - Scaling things - Arvados

1

Tom Clegg

h1. Scaling things

2

3

In principle, an Arvados cluster with access to sufficient hardware/cloud resources should be able to handle arbitrarily large datasets, computations, and interactive usage. In practice, there are limitations. This wiki aims to catalog limitations and strategies to address them.

4

5

h2. Collection size

6

7

Collections with a large number of files

8

* Slowness due to large manifest being sent over the network in order to load/update a single file

9

* High memory usage (in several components) due to large manifest

10

11

h2. Total data size

12

13

Large number of blocks

14

* High memory usage in keep-balance

15

* High garbage collection / replication adjustment latency due to long keep-balance iterations

16

* High sensitivity to back-end errors (a back-end error while indexing can abort an entire keep-balance iteration)

17

18

h2. Container queue size

19

20

Large number of queued containers

21

* Higher dispatcher latency due to reloading entire queue

22

* Excessive controller/rails/db load due to dispatcher reloading entire queue every N seconds

23

* Scheduling/prioritization effects when cloud services are limited (e.g., instance quota)

24

* High dispatcher memory use (function of # queued+running, not just # running)

25

26

Large number of running containers

27

* Lock contention due to cascading container/container request updates

28

* Controller/rails bottleneck causes container/log/output updates to take longer

29

* Interactive usage suffers when controller/rails is busy servicing many containers

Project

General

Profile

Arvados

Scaling things » History » Version 1