Changing architecture for dynamic node spinup
|Velocity based estimate||-|
Put SLURM in control of deciding when nodes should be allocated or destroyed on the cloud. This is necessary to avoid second-guessing slurm when multiple jobs can share a node. This can be implemented in a way that is compatible with both current crunch and crunch v2, and is likely to improve stability of current crunch.
1) Arvados nodes table has a static list of entries (0..N) for each available compute node size; list resources of each node (CPUs, memory); the state from sinfo (idle/alloc/down); state from node manager (booting/running/shutdown); and whether we desire the the node to be up or down.
2) Generate partial SLURM configuration from nodes table with nodes marked as "cloud" type; see https://dev.arvados.org/issues/6520#note-5 for details. ResumeProgram and SuspendProgram contact the API server and adjust the "desired state up/down" flag.
3) Change architecture of node manager to get rid of "wishlist" and monitoring arvados job queue and instead compares Arvados node list & cloud node list and decides which nodes to start and stop based on the "up/down" flag in the node record.
4) Remove code from crunch-dispatch that explicitly selects nodes (#nodes_available_for_job), instead run salloc with runtime constraints translated into salloc parameters --nodes, --mincpus, --mem. Remove --immediate flag from salloc so that the request is queued (which will cause slurm to request more nodes if no idle nodes are available.) One added benefit is that salloc will fail immediately if a job requests resources that cannot possibly be fulfilled in the current configuration, this can be communicated to the user.
5) crunch-dispatch-slurm (crunch v2) will put jobs in the queue using sbatch and use the similar --mincpus, --mem settings.
#3 Updated by Peter Amstutz over 1 year ago
Because there will be changes required to several interacting components (slurm config, nodes table on api server, crunch-dispatch, and node manager) it may be impossible to change each component separately in a way that is simultaneously compatible with both the current way of doing things and the proposed approach. We could do this in big-bang feature branch that touches everything at once. Alternately, we can implement the changes in a default deactivated state and then flip a configuration switch to enable them all once everything is deployed.
(The 2nd option is more incremental and arguably lower risk, but also likely to take twice as long and be more work overall: adding the configuration knobs, using it to switch between two sets of code, and additional stories for integration testing and cleaning up the old code paths post-migration)