[Node Manager] Behave smarter in environments where scratch space can be arbitrarily sized, like GCP and AWS
Node manager's current handling is of scratch space in those scenarios where node types do not have any default scratch space - like AWS and GCP - is very inflexible.
The current status quo is on AWS is that node manager can create arbitrary node type/scratch space combinations, but only for those combinations defined in its config file.
From an operational perspective, this really sucks.
In #12793, on AWS, we had a situation where a container requested 389 GiB of scratch space on a 16 core node with 49 GiB of ram. Node manager had a matching node type for cores and ram (m4.4xlarge), but that node type was only defined with 310GB of scratch space in the config file. That number was chosen fairly arbitrarily. I worked around the problem by increasing the scratch space definition for that node type to 512GB.
Obviously, we could have picked higher numbers in the config to start out, but that adds to the running cost of every node of that size - irrespective of how much scratch is requested by the job.
This situation is pretty annoying.
1) It's completely opaque to the user. All they see in the container log is the rather cryptic error message: "Requirements for a single node exceed the available cloud node size". At the very least can we please get error messages that humans can parse? Say, "no node types available with this much scratch space"? Better yet, spit out the table of available node types with ram/cpu/scratch, and add a line for the container so that it is obvious what is going wrong.
2) But the situation is also pretty unworkable for OPS - how are we supposed to determine the optimal scratch space size definitions for a particular installation? It depends on what workflows need, which means it suddenly becomes highly site specific, and highly specific to the particular load of that particular week...
3) It also adds unnecessary cost to the runtime - by definition, we are overspecifying scratch space for every run on every node. That's wasting money.
I would like scratch space to be handled differently.
Here's one idea.
On GCP and AWS scratch space is effectively a variable that can be set irrespective of the node type. I would like to see it handled that way by our code.
Ideally, an appropriately sized scratch volume (or volume(s)?) is attached to the node when the container is about to start, and removed when the container is done - or when the node shuts down, whichever comes first.
That way the user always gets what they request (up to the cloud limit, obviously!) and the scheduling decision also gets easier since scratch space is basically not a factor in the choice of which node to schedule a job on.