Project

General

Profile

Actions

Idea #2880

closed

Component/job can specify minimum memory and scratch space for worker nodes, and Crunch enforces these requirements at runtime

Added by Tom Clegg almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
3.0

Description

Implement runtime constraints

The runtime constraints are currently specced as:

  • min_ram_mb_per_node - The amount of RAM (MiB) available on each Node.
  • min_scratch_mb_per_node - The amount of disk space (MiB) available for local caching on each Node.

For now, crunch-dispatch should start the biggest job that it can given the Nodes that are currently available.


Subtasks 4 (0 open4 closed)

Task #2967: Compute nodes report resource information in pingsResolvedBrett Smith06/05/2014Actions
Task #2975: Review 2880-compute-ping-statsResolvedBrett Smith06/05/2014Actions
Task #2993: Review 2880-crunch-dispatch-node-constraints-wipResolvedBrett Smith06/06/2014Actions
Task #2976: Crunch only starts jobs when hardware constraints are satisfiedResolvedBrett Smith06/05/2014Actions
Actions #1

Updated by Brett Smith almost 10 years ago

  • Project changed from 37 to Arvados
Actions #2

Updated by Brett Smith almost 10 years ago

  • Assigned To set to Brett Smith
Actions #3

Updated by Brett Smith almost 10 years ago

  • Description updated (diff)

Need to have a conversation with Ward to flesh out this story better and figure out the exact lines of where the Node Manager ends and crunch-dispatch begins.

How does crunch-dispatch figure out which resources are available on the compute nodes? Is this available from SLURM, or should we record it somewhere? If the latter, does it make sense to have the Node include this information when it pings the API server?

Actions #4

Updated by Brett Smith almost 10 years ago

Ward and I discussed this, and we're agreed that for this sprint, we're going to do the simplest thing that can possibly work:

  • There will be no communication between the Node Manager and Crunch at this time. They will both ask the API server for the Job queue, and only use the information in there to make resource allocation decisions.
  • Because we currently don't have a concept of Job priority, and it's not in this sprint, it seems best to stick pretty closely to Crunch's current FIFO strategy for working the Job queue. However, we need to take precautions to make sure that an unreasonably resource-large Job at the front of the queue doesn't prevent us from making progress on the rest of it.
  • Planned implementation: When the Job at the front of the queue can't be started because resource requirements aren't met, Crunch will wait for a few minutes to see if the Node Manager makes those resources available. If it does, great; proceed as normal. If not, continue through the queue and start the first job that can be run with available resources. Make sure that this wait only happens every so often, so lots of queue activity doesn't cause lots of waiting.
  • If there's not currently a good way to get resource information, add it to the Node pings.
Actions #5

Updated by Ward Vandewege almost 10 years ago

Reviewed 2880-compute-ping-stats.

The only thought I had was that it probably will make sense to report scratch space per partition at some point. But I think for now it's good enough. We need to come up with scratch space conventions at some point (mount points, etc), and when we do that, we can adjust the reporting. So, LGTM.

Actions #6

Updated by Peter Amstutz almost 10 years ago

  • nodes_available_for_job_now() looks like it will always return a list of size min_node_count (or nil) even if there are more than min_node_count nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move the if usable_nodes.count >= min_node_count outside of the find_each loop.
  • Using sleep() to block nodes_available_for_job also blocks anything else that crunch-dispatching should be doing, such as handling already running jobs or pipelines.
  • As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.
Actions #7

Updated by Brett Smith almost 10 years ago

Peter Amstutz wrote:

  • nodes_available_for_job_now() looks like it will always return a list of size min_node_count (or nil) even if there are more than min_node_count nodes available. Based on the description "crunch-dispatch should start the biggest job that it can", you should move the if usable_nodes.count >= min_node_count outside of the find_each loop.

The "run the biggest job we can" idea was to launch the Job with the heaviest resource requirements, not to give Jobs more resources than they need. We don't want to do the latter right now because we don't have any infrastructure to revoke excess resources from running Jobs in order to start others—I think that's more of a Crunch v2 thing. Plus, this idea was generally obsoleted by the decision in note 4 to stick closer to the existing FIFO scheduling.

  • As I understand it, it will only go into a wait once every hour, and once the wait happens, it automatically skips stalled jobs, regardless of what has happened in the queue in the mean time (so job A gets to wait for resources, but job B will get skipped immediately in favor of job C immediately if B was queued within an hour of A). This seems counter intuitive and unfair to job B.

You understand correctly, and you're right that it's unfair to B. It's a trade-off that we're making in order to keep the Job queue generally flowing. If we allowed a wait for every Job, then the sudden arrival of a bunch of large Jobs could stall progress through the queue for an extended period of time. ("I'm waiting for resources for Job A. There still aren't resources? Okay, I'm waiting for resources for Job B. There still aren't resources? Okay…") This seems like a worse problem generally.

If it helps, note that crunch-dispatch will still check to see if Nodes are available for Jobs at the front of the queue before skipping them. Once Job A finishes, Job B gets the first chance to claim the resources that it freed.

I'll get to work on fixing the sleep issue.

Actions #8

Updated by Brett Smith almost 10 years ago

Brett Smith wrote:

I'll get to work on fixing the sleep issue.

Done in 139728cc. Merged with master and branch at 3d371ed1 is ready for another look.

Actions #9

Updated by Peter Amstutz almost 10 years ago

Looks good to me

Actions #10

Updated by Brett Smith almost 10 years ago

  • Status changed from New to Resolved
  • % Done changed from 41 to 100

Applied in changeset arvados|commit:82c4697bf24b10f3fb66d303ae73499095b5742a.

Actions

Also available in: Atom PDF