Story #2880

Component/job can specify minimum memory and scratch space for worker nodes, and Crunch enforces these requirements at runtime

Added by Tom Clegg over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Start date:
06/05/2014
Due date:
% Done:

100%

Estimated time:
(Total: 17.00 h)
Story points:
3.0

Description

Implement runtime constraints

The runtime constraints are currently specced as:

  • min_ram_mb_per_node - The amount of RAM (MiB) available on each Node.
  • min_scratch_mb_per_node - The amount of disk space (MiB) available for local caching on each Node.

For now, crunch-dispatch should start the biggest job that it can given the Nodes that are currently available.


Subtasks

Task #2967: Compute nodes report resource information in pingsResolvedBrett Smith

Task #2975: Review 2880-compute-ping-statsResolvedBrett Smith

Task #2993: Review 2880-crunch-dispatch-node-constraints-wipResolvedBrett Smith

Task #2976: Crunch only starts jobs when hardware constraints are satisfiedResolvedBrett Smith

Associated revisions

Revision 6e873d99 (diff)
Added by Brett Smith over 6 years ago

2880: Don't dispatch Jobs until runtime constraints are met.

This retains the same FIFO approach to the Job queue that
crunch-dispatch currently uses, but now when it encounters a Job whose
constraints are not met:

  • it may wait for a while to see if the Node Manager makes Nodes
    available, if it hasn't done that this hour; and
  • it leaves that Job in the queue and tries to process the next one.

See #2880 for further background. The exact parameters of "waiting
for Nodes" will probably need tuning, but that will be easier to do
after it's been in production for a while.

Revision 139728cc (diff)
Added by Brett Smith over 6 years ago

2880: Avoid long sleeps in crunch-dispatch.

From feedback in refs #2880. Now instead of sleeping, we set a
deadline that decides whether to break or continue through start_jobs'
main loop.

Revision 505f5c37 (diff)
Added by Brett Smith over 6 years ago

2880: Don't dispatch Jobs until runtime constraints are met.

This retains the same FIFO approach to the Job queue that
crunch-dispatch currently uses, but now when it encounters a Job whose
constraints are not met:

  • it may wait for a while to see if the Node Manager makes Nodes
    available, if it hasn't done that this hour; and
  • it leaves that Job in the queue and tries to process the next one.

See #2880 for further background. The exact parameters of "waiting
for Nodes" will probably need tuning, but that will be easier to do
after it's been in production for a while.

Revision 82c4697b
Added by Brett Smith over 6 years ago

Merge branch '2880-crunch-dispatch-node-constraints'

Closes #2880, #2976, #2993.

Revision fd4efd53 (diff)
Added by Brett Smith over 6 years ago

Make Job runtime constraints documentation up-to-date.

Refs #2879, #2880.

History

#1 Updated by Brett Smith over 6 years ago

  • Project changed from Umbrella Project to Arvados

#2 Updated by Brett Smith over 6 years ago

  • Assigned To set to Brett Smith

#3 Updated by Brett Smith over 6 years ago

  • Description updated (diff)

Need to have a conversation with Ward to flesh out this story better and figure out the exact lines of where the Node Manager ends and crunch-dispatch begins.

How does crunch-dispatch figure out which resources are available on the compute nodes? Is this available from SLURM, or should we record it somewhere? If the latter, does it make sense to have the Node include this information when it pings the API server?

#4 Updated by Brett Smith over 6 years ago

Ward and I discussed this, and we're agreed that for this sprint, we're going to do the simplest thing that can possibly work:

  • There will be no communication between the Node Manager and Crunch at this time. They will both ask the API server for the Job queue, and only use the information in there to make resource allocation decisions.
  • Because we currently don't have a concept of Job priority, and it's not in this sprint, it seems best to stick pretty closely to Crunch's current FIFO strategy for working the Job queue. However, we need to take precautions to make sure that an unreasonably resource-large Job at the front of the queue doesn't prevent us from making progress on the rest of it.
  • Planned implementation: When the Job at the front of the queue can't be started because resource requirements aren't met, Crunch will wait for a few minutes to see if the Node Manager makes those resources available. If it does, great; proceed as normal. If not, continue through the queue and start the first job that can be run with available resources. Make sure that this wait only happens every so often, so lots of queue activity doesn't cause lots of waiting.
  • If there's not currently a good way to get resource information, add it to the Node pings.

#5 Updated by Ward Vandewege over 6 years ago

Reviewed 2880-compute-ping-stats.

The only thought I had was that it probably will make sense to report scratch space per partition at some point. But I think for now it's good enough. We need to come up with scratch space conventions at some point (mount points, etc), and when we do that, we can adjust the reporting. So, LGTM.

#6 Updated by Peter Amstutz over 6 years ago

  • nodes_available_for_job_now() looks like it will always return a list of size min_node_count (or nil) even if there are more than min_node_count nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move the if usable_nodes.count >= min_node_count outside of the find_each loop.
  • Using sleep() to block nodes_available_for_job also blocks anything else that crunch-dispatching should be doing, such as handling already running jobs or pipelines.
  • As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.

#7 Updated by Brett Smith over 6 years ago

Peter Amstutz wrote:

  • nodes_available_for_job_now() looks like it will always return a list of size min_node_count (or nil) even if there are more than min_node_count nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move the if usable_nodes.count >= min_node_count outside of the find_each loop.

The "run the biggest job we can" idea was to launch the Job with the heaviest resource requirements, not to give Jobs more resources than they need. We don't want to do the latter right now because we don't have any infrastructure to revoke excess resources from running Jobs in order to start others—I think that's more of a Crunch v2 thing. Plus, this idea was generally obsoleted by the decision in note 4 to stick closer to the existing FIFO scheduling.

  • As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.

You understand correctly, and you're right that it's unfair to B. It's a trade-off that we're making in order to keep the Job queue generally flowing. If we allowed a wait for every Job, then the sudden arrival of a bunch of large Jobs could stall progress through the queue for an extended period of time. ("I'm waiting for resources for Job A. There still aren't resources? Okay, I'm waiting for resources for Job B. There still aren't resources? Okay…") This seems like a worse problem generally.

If it helps, note that crunch-dispatch will still check to see if Nodes are available for Jobs at the front of the queue before skipping them. Once Job A finishes, Job B gets the first chance to claim the resources that it freed.

I'll get to work on fixing the sleep issue.

#8 Updated by Brett Smith over 6 years ago

Brett Smith wrote:

I'll get to work on fixing the sleep issue.

Done in 139728cc. Merged with master and branch at 3d371ed1 is ready for another look.

#9 Updated by Peter Amstutz over 6 years ago

Looks good to me

#10 Updated by Brett Smith over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 41 to 100

Applied in changeset arvados|commit:82c4697bf24b10f3fb66d303ae73499095b5742a.

Also available in: Atom PDF