Idea #2880
closedComponent/job can specify minimum memory and scratch space for worker nodes, and Crunch enforces these requirements at runtime
Description
Implement runtime constraints¶
The runtime constraints are currently specced as:
min_ram_mb_per_node
- The amount of RAM (MiB) available on each Node.min_scratch_mb_per_node
- The amount of disk space (MiB) available for local caching on each Node.
For now, crunch-dispatch should start the biggest job that it can given the Nodes that are currently available.
Updated by Brett Smith over 10 years ago
- Description updated (diff)
Need to have a conversation with Ward to flesh out this story better and figure out the exact lines of where the Node Manager ends and crunch-dispatch begins.
How does crunch-dispatch figure out which resources are available on the compute nodes? Is this available from SLURM, or should we record it somewhere? If the latter, does it make sense to have the Node include this information when it pings the API server?
Updated by Brett Smith over 10 years ago
Ward and I discussed this, and we're agreed that for this sprint, we're going to do the simplest thing that can possibly work:
- There will be no communication between the Node Manager and Crunch at this time. They will both ask the API server for the Job queue, and only use the information in there to make resource allocation decisions.
- Because we currently don't have a concept of Job priority, and it's not in this sprint, it seems best to stick pretty closely to Crunch's current FIFO strategy for working the Job queue. However, we need to take precautions to make sure that an unreasonably resource-large Job at the front of the queue doesn't prevent us from making progress on the rest of it.
- Planned implementation: When the Job at the front of the queue can't be started because resource requirements aren't met, Crunch will wait for a few minutes to see if the Node Manager makes those resources available. If it does, great; proceed as normal. If not, continue through the queue and start the first job that can be run with available resources. Make sure that this wait only happens every so often, so lots of queue activity doesn't cause lots of waiting.
- If there's not currently a good way to get resource information, add it to the Node pings.
Updated by Ward Vandewege over 10 years ago
Reviewed 2880-compute-ping-stats.
The only thought I had was that it probably will make sense to report scratch space per partition at some point. But I think for now it's good enough. We need to come up with scratch space conventions at some point (mount points, etc), and when we do that, we can adjust the reporting. So, LGTM.
Updated by Peter Amstutz over 10 years ago
-
nodes_available_for_job_now()
looks like it will always return a list of sizemin_node_count
(ornil
) even if there are more thanmin_node_count
nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move theif usable_nodes.count >= min_node_count
outside of the find_each loop. - Using
sleep()
to blocknodes_available_for_job
also blocks anything else that crunch-dispatching should be doing, such as handling already running jobs or pipelines. - As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
nodes_available_for_job_now()
looks like it will always return a list of sizemin_node_count
(ornil
) even if there are more thanmin_node_count
nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move theif usable_nodes.count >= min_node_count
outside of the find_each loop.
The "run the biggest job we can" idea was to launch the Job with the heaviest resource requirements, not to give Jobs more resources than they need. We don't want to do the latter right now because we don't have any infrastructure to revoke excess resources from running Jobs in order to start others—I think that's more of a Crunch v2 thing. Plus, this idea was generally obsoleted by the decision in note 4 to stick closer to the existing FIFO scheduling.
- As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.
You understand correctly, and you're right that it's unfair to B. It's a trade-off that we're making in order to keep the Job queue generally flowing. If we allowed a wait for every Job, then the sudden arrival of a bunch of large Jobs could stall progress through the queue for an extended period of time. ("I'm waiting for resources for Job A. There still aren't resources? Okay, I'm waiting for resources for Job B. There still aren't resources? Okay…") This seems like a worse problem generally.
If it helps, note that crunch-dispatch will still check to see if Nodes are available for Jobs at the front of the queue before skipping them. Once Job A finishes, Job B gets the first chance to claim the resources that it freed.
I'll get to work on fixing the sleep issue.
Updated by Brett Smith over 10 years ago
Updated by Brett Smith over 10 years ago
- Status changed from New to Resolved
- % Done changed from 41 to 100
Applied in changeset arvados|commit:82c4697bf24b10f3fb66d303ae73499095b5742a.