Idea #20473
openAutomated scalability regression test
Description
Write at automated test that
- brings up an Arvados cluster
- submits a large work queue
- lets it run for some short timeāat least some containers should finish, but not all or even most of them
- checks logs and metrics of all services afterwards, and fails if any of the following appear:
- 5xx responses from web services
- containers being retried or other signs of Crunch thrashing
- Crunch does not use maximum compute nodes available to it
- Other signs of trouble in Prometheus (tbd: what?)
This test is not expected to run on every branch or even commit to main. Instead we run it when we're testing a branch that could have significant scalability consequences, or when we're preparing a major release.
Implementation details (we're less wedded to these): The basic idea is to spin up a middle-sized cloud node, deploy a single-node Arvados cluster onto it, and run the tests there. We can submit large workflows to generate large-record-size container requests, but all workflows and workflow steps should have tiny resource requirements, so we can run a lot of them on the same node. For example, maybe download a multi-GiB collection to a temporary directory, and then confirm its portable data hash.
The cluster should use the default configuration as much as possible. The only configuration values that should change are the ones that are necessarily tied to the capabilities of the underlying hardware, like MaxComputeVMs.
Updated by Brett Smith over 1 year ago
- Category set to Tests
- Description updated (diff)
Updated by Brett Smith over 1 year ago
- Related to Feature #14922: Run multiple containers concurrently on a single cloud VM added