Bug #12199

Don't schedule jobs on nodes which are too much bigger than requested

Added by Tom Morris almost 3 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/29/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto

Description

Add cloud node size and price info to site configuration object:
  • cloud instance id
  • price
  • cores
  • ram
  • disk
Add Go package (server/dispatch_cloud) that exports a node size calculator.
  • Load cluster config
  • Given an arvados.Container record, return the appropriate instance type (not just a string, but the whole type record from the config)
  • Return a predictable error (ErrInstanceTypesNotConfigured?) if none are listed
  • Return a predictable error (ErrConstraintsNotSatisfiable?) if no instance type is big enough

In crunch-dispatch-slurm, before submitting a slurm job, use the dispatch_cloud calculator to determine the appropriate instance type. Add it to the sbatch command line in "--constraints". Omit constraints if calculator returns ErrInstanceTypesNotConfigured. Cancel the container if calculator returns ErrConstraintsNotSatisifiable.

In nodemanager, for each slurm job with constraints, use the indicated instance type for the wishlist.

In the node boot script, use scontrol to set the slurm feature/constraint to match whatever in the Arvados node record.

Optional (or not?): In nodemanager, load the site configuration file, and use the instance types found there instead of the ones in the "sizes" section of nodemanager's own config file.


Subtasks

Task #12971: Review 12199-dispatch-to-node-typeResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #12908: install docs for slurm don't allow multiple jobs per nodeResolved

Related to Arvados - Bug #12794: [Node Manager] Behave smarter in environments where scratch space can be arbitrarily sized, like GCP and AWSNew

Related to Arvados - Bug #13166: [node manager] wishlist should consist of top priority containersResolved03/26/2018

Associated revisions

Revision 8dc4be85 (diff)
Added by Ward Vandewege almost 3 years ago

This is a crude first cut of a feature to effectively turn
min_cores_per_node into exact_cores_per_node. Azure specific.

Definitely not merge ready.

refs #12199

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision 924c37b6 (diff)
Added by Ward Vandewege almost 3 years ago

Take 2

refs #12199

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

Revision ba1cbe58
Added by Tom Clegg over 2 years ago

Merge branch '12199-dispatch-to-node-type'

refs #12199

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision de9a366c (diff)
Added by Tom Clegg over 2 years ago

Add missing parameter to update call.

fixes #12199

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision d9c9b4b6
Added by Tom Clegg over 2 years ago

Merge branch '13098-assign-hostname'

refs #13098
refs #12199

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg almost 3 years ago

The definition of "bigger than requested" here should take into account that
  • container requirements specify what the container is expected to make use of, and are deliberately isolated from the node sizes that happen to be available at runtime
  • the goal is to control cost, not resource availability
For example:
  • Define "minprice" as the price of the smallest node type that can satisfy a container's requirements.
  • A container must not run on a node whose price exceeds minprice * some_ratio + some_constant where some_ratio and some_constant are configs.

This would let the site admin set a policy like "try to pack containers onto bigger nodes, but don't risk occupying a node that's twice as expensive" or "...wasting more than $1/hr per container". It would work well even if the available node types have prices like $1.00, $1.01, $1.02, $2, $5, where a rule like "no more than one size up" would fail.

(What should some_ratio and some_constant be called in the config file?)

#2 Updated by Peter Amstutz almost 3 years ago

  1. When dispatching, need to prevent jobs from being scheduled on too-big nodes. We can use the sbatch option --extra-node-info to set the upper limit.
  2. When determining if the set of cloud nodes satisfies the job queue, Node Manager need to discard too-big nodes from consideration (which may imply booting a smaller node)

Need to define a policy for what is "too big". For example:

  1. Find the closest node size that accommodates the resource request (requires that the scheduler is aware of available node sizes)
  2. Restrict scheduling to nodes with up to 2x the core count of the preferred node size to ensure that nodes are at least 50% utilized.

#3 Updated by Ward Vandewege almost 3 years ago

After brainstorming with Peter, I made a very crude, simple version of this in branch 12199-cost-control at 8dc4be851bd9a896c377e19cfbb06eeb70d5269a

Not merge ready for sure. Maybe we could drive this from a few config parameters?

#4 Updated by Peter Amstutz almost 3 years ago

The core counts could be an array [1, 2, 4, 8, 16, 20] for rounding the requested core count up to the nearest size without hardcoding any special cases. This could be a configuration parameter.

#5 Updated by Ward Vandewege almost 3 years ago

Peter Amstutz wrote:

The core counts could be an array [1, 2, 4, 8, 16, 20] for rounding the requested core count up to the nearest size without hardcoding any special cases. This could be a configuration parameter.

Good call.

New take at 924c37b65cdf08bfb46bb8ee206d3c6604990eaa

#6 Updated by Nico C├ęsar almost 3 years ago

I think we need a configuration that matches arvados-node-manager available sizes. this should be a centralized configuration. This is a source of possible misconfigurations.

I know that we want to throw this out the door as soon as possible, but is it possible to have a more detail configuration in the API server (that is all the node types and sizes, etc) and arvados-node-manager is able to pull that configuration at start up? this will make a change in the available node types to use to make an api restart. or even better to have it in something more dynamic like in the database. Or even better have a centralized configuration like consul

#7 Updated by Tom Morris almost 3 years ago

  • Target version changed from Arvados Future Sprints to To Be Groomed

#8 Updated by Ward Vandewege almost 3 years ago

This approach isn't going to be as simple as originally thought. The problem is that jobs can have other constraints that require bigger nodes. For example, a cores constraint for 1 core, and a memory constraints that effectively means a 2 core node is necessary.

To avoid those jobs from becoming unstartable, the c-d patch would need to basically duplicate node manager's logic for determining if a job should be scheduled on the node or not. That seems like a silly path to go down.

#9 Updated by Tom Clegg over 2 years ago

See note-1 above. This is about reducing price, not cpus and memory. The config knobs and code should compare prices, not cpus and memory.

Node manager is the scheduler here. Currently NM only considers the minimum resources to accommodate a job. Here it needs to consider the maximum price, too.

"Features" might be the easiest way to communicate this decision to slurm: at a site where node manager is running, when submitting jobs to the slurm queue the dispatcher sets Features=UUID (where UUID is the job/container UUID). It will become runnable when nodemanager adds the UUID to a node's Features list. (This assumes slurm will accept a job submission whose features are not currently offered by any node. Generic resources with name=UUID count=1 might also work.)

#11 Updated by Tom Clegg over 2 years ago

  • Subject changed from Don't schedule jobs on nodes which are bigger than requested to Don't schedule jobs on nodes which are too much bigger than requested

#12 Updated by Tom Clegg over 2 years ago

Tom Clegg wrote:

For example:
  • Define "minprice" as the price of the smallest node type that can satisfy a container's requirements.
  • A container must not run on a node whose price exceeds minprice * some_ratio + some_constant where some_ratio and some_constant are configs.

(What should some_ratio and some_constant be called in the config file?)

maybe max_alloc_price_margin_ratio and max_alloc_price_margin?

Alternative formula: max(minprice * some_ratio, minprice + some_constant) < actualprice

#13 Updated by Tom Morris over 2 years ago

  • Story points set to 3.0

#14 Updated by Tom Clegg over 2 years ago

(from meeting) we might not even need some_ratio>0 or some_constant>0 so we can start with the simpler version, "jobs/containers can only ever run on the cheapest compatible node type".

#15 Updated by Peter Amstutz over 2 years ago

From 3 Jan 2018 discussion:

  • API server gains a node_sizes table with fields:
    • uuid
    • cloud instance id
    • price
    • cores
    • ram
    • disk
  • When creating a container record from a container request, determine the appropriate node size and put the uuid in "node_size" field under "scheduling_parameters"
  • In crunch-dispatch-slurm, look at "node_size" under "scheduling_parameters" and add it to the sbatch command line in "--constraints"
  • In nodemanager, for each slurm job with constraints, consult the API server's "node_types" table to get the corresponding cloud node type for the wishlist.
  • The "Size" sections of the node manager configuration may be ignored in favor of the information from the API server's node_types.

#16 Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)

#17 Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)

#18 Updated by Tom Clegg over 2 years ago

more background from meeting: The approach in note-9 won't work because nodemanager is too brittle/confusing to accommodate the necessary change, i.e., keep track of which containers made us decide to bring up new nodes / keep idle nodes alive, and propagate that information to slurm. So we're looking for a solution that involves touching nodemanager less than that.

I'm not keen on using Postgres for the node size list. That seems like site configuration, which goes in /etc/ until we start using consul/etcd/whatever.

Determining the right node size for a container is something that should be done by the cloud-VM dispatcher (which doesn't exist yet -- "crunch-dispatch-cloud-node"?) when trying to dispatch, rather than API server.

If we can't get all the way there in one go, and the cloud case still relies on slurm to execute crunch-run on the workers, then using scheduling_parameters to communicate the node-size information to crunch-dispatch-slurm seems like a good hack.

Perhaps there is a way we can start implementing crunch-dispatch-cloud-node, instead of writing new throwaway code in API server?

#19 Updated by Tom Clegg over 2 years ago

  • Related to Bug #12908: install docs for slurm don't allow multiple jobs per node added

#20 Updated by Tom Clegg over 2 years ago

We're reminded in #12908 that the slurm config in our install docs prevents slurm from scheduling multiple small jobs on a big node. IIRC the we recommended this non-sharing config (knowing that it would be wasteful) only because it allowed crunch1 and crunch2 to coexist on the same slurm cluster. The idea was that when crunch1 is no longer used, config should be updated to "select/cons_res", but we missed mentioning that in the docs.

There are scenarios where packing small containers onto a large node is faster and/or cheaper than bringing up multiple small nodes (less idle node waste and overall latency). The advantages are most likely when the cluster is very busy with a diverse workload. But with select/linear, I'd expect the waste ratio to get even higher under those same conditions.

Just to clarify: Are we pursing the "prevent node-sharing" approach here because in practice we've found slurm's node-sharing to be so much more wasteful than churning nodes? Or because we haven't tried it (e.g., because we forgot it's turned off)? Or because we need to address efficiency on clusters that use both crunch1 and crunch2?

#21 Updated by Tom Clegg over 2 years ago

...it turns out that the answer is that select/cons_res removes slurm's ability to reconfigure node sizes on the fly, so it's incompatible with our slurm-on-cloud approach.

#22 Updated by Tom Clegg over 2 years ago

  • Description updated (diff)

#23 Updated by Tom Morris over 2 years ago

  • Target version changed from To Be Groomed to 2018-01-31 Sprint

#24 Updated by Tom Clegg over 2 years ago

  • Assigned To set to Tom Clegg

#25 Updated by Tom Clegg over 2 years ago

  • Description updated (diff)

#26 Updated by Tom Clegg over 2 years ago

  • Status changed from New to In Progress

#27 Updated by Tom Clegg over 2 years ago

Slight complication: a slurm job submission (via srun or sbatch) is immediately rejected if it refers to features that haven't been advertised by any node.

Example cluster:
node features state
compute0 m1.large idle
compute1 (none) idle
compute2 m2.large down
Test cases:
  • srun --constraint='m1.large' → runs on compute0
  • srun --constraint='m2.large' → queued; runs on compute1 if someone does scontrol update NodeName=compute1 Features=m2.large
  • srun --constraint='m3.large'srun: error: Unable to allocate resources: Invalid feature specification
  • srun --constraint='m4.large|m1.large'srun: error: Unable to allocate resources: Invalid feature specification (even though compute0 qualifies to run the job regardless of what m4.large means, this is treated as a syntax error)

(Aside: Generic resources work the same way. Also, once a feature has been added to a node, slurm does appear to remember that the feature name is valid, even after it's been removed from all nodes.)

So it seems that the desired "wait until an idle node appears with this feature" behavior can be achieved by configuring a "dummy" node (a hostname that will never be up) which advertises all of the node-type features at once.

scontrol update NodeName=nullnode0 Features=m1.large,m2.large,m4.large State=down

#28 Updated by Tom Clegg over 2 years ago

12199-dispatch-to-node-type @ bad6499910388c17c4a54b3128f361aa36a670e1

crunch-dispatch-slurm now tries to load /etc/arvados/config.yml. If /etc/arvados/config.yml exists and InstanceTypes has at least one entry:
  • At startup and periodically (~1m), crunch-dispatch-slurm tests whether all of the configured node types are accepted by slurm as valid feature flags. If not, it adds them all to compute0, then clears them from compute0. This has the side effect of making slurm recognize them as valid features.
  • When submitting a slurm job to run a container, crunch-dispatch-slurm chooses the cheapest node type with enough capacity for the container.
    • If no node type has enough capacity, it logs an error and cancels the container.
    • Otherwise, it adds a --constraint=X argument to the sbatch command line, where X is the node type.

This means that -- if there are any InstanceTypes configured -- each container will sit in the slurm queue until there is an idle slurm node that's tagged with the desired instance type. Currently, nodemanager doesn't do this. So it's premature to add any InstanceTypes to your /etc/arvados/config.yml.

This could be used as is on a static slurm cluster. However:
  • A static slurm cluster probably doesn't need this feature -- it can just use select/cons_res and let slurm do the scheduling.
  • If a static slurm cluster uses this feature anyway, make sure either (a) all instance types are accepted by slurm as feature names, or (b) there is a dummy node called "compute0" and crunch-dispatch-slurm runs as root, so crunch-dispatch-slurm can use the "add all features to compute0 when needed" kludge described above.

This branch includes some refactoring in crunch-dispatch-slurm to eliminate mutable globals, with a view to [a] making testing a bit more sane and [b] making progress towards offering it as a module so it can also be invoked as "arvados-server dispatch-slurm".

#29 Updated by Tom Clegg over 2 years ago

12199-dispatch-to-node-type @ 3ccc70f4bb06bab6c0b3c71f555cba24cc5c6a47

api server adds an "assign_slot" flag to node#create and node#update APIs, which assigns a slot number and hostname. Nodemanager uses this flag when preparing to create a new node.
  • this makes it possible for node manager to know the hostname before the node joins SLURM. This is needed to avoid a race condition where a new node of size A is already tagged with size=B when it comes up (because the last node that used the same hostname was size B) and SLURM immediately uses it to schedule a job requiring size=B.
  • incidentally, this improves behavior in misconfiguration/bug scenario we have run into before where node manager decides to bring up a node, but apiserver doesn't have any available slots (aborting the "create" process is better than bringing up a node that fails ping, timing out, shutting it down, and repeating ad nauseum).
nodemanager calls "scontrol update NodeName=foo Weight=1234 Features=instancetype=bar"
  • after allocating the Arvados node record and hostname (before creating the cloud instance), and
  • on each paired node each time it receives the latest node list from Arvados

#30 Updated by Peter Amstutz over 2 years ago

Some observations as I go through:

  • I assume it evaluates from left to right and/or multiplication has higher precedence, but it would be better if it was explicit by adding parens:
needRAM = needRAM * 100 / int64(100-discountConfiguredRAMPercent)
  • The Arvados node list is refreshed every 1-2 minutes (depending on configuration). Running scontrol update hundreds of times a minute increases the risk of falling over under load. It would be more efficient to parse the output of sinfo for all nodes and only update mismatches.

My recommendation: ArvadosNodeListMonitorActor already runs "sinfo", it can additionally parse out the features assigned to each slurm node and attach that to the arvados record (which is just a dict). The sync_node method can then check that before running scontrol.

#31 Updated by Tom Morris over 2 years ago

  • Target version changed from 2018-01-31 Sprint to 2018-02-14 Sprint

#32 Updated by Tom Clegg over 2 years ago

12199-dispatch-to-node-type @ 3c5378e79de03e312268e31a10fc93650cafbb10
  • add parens to make precedence more obvious.
  • track slurm node features using monitor actor, update in sync_node only when needed.

#33 Updated by Peter Amstutz over 2 years ago

A thought about testing, this is the sort of thing the node manager integration test framework is useful for. I suggest an integration test that boots a node and checks that scontrol is being called as expected on node creation and on an inconsistent sinfo result. (Alternately, you could piggyback on an existing test by adding additional checks.)

#34 Updated by Tom Clegg over 2 years ago

thoughts on integration tests
  • I would say the nodemanager integration test framework is broken. Each test case takes 1.5 minutes. Sometimes the whole test suite aborts with an inscrutable error message and then works on retry. I'm not prepared to add to that. (Any ideas about how to fix it?)
  • Testing that the pieces get wired up the way we expect, sure. But testing "feature A1 of unit A works even when invoked by unit B, not only when invoked by unit tests" seems wrong.
  • An explicit goal of our implementation strategy was to minimize changes to nodemanager, the premise being that it's too brittle to work on. IOW, nodemanager is not a good investment.

#35 Updated by Peter Amstutz over 2 years ago

  • Related to Bug #12794: [Node Manager] Behave smarter in environments where scratch space can be arbitrarily sized, like GCP and AWS added

#36 Updated by Tom Clegg over 2 years ago

12199-dispatch-to-node-type @ 20c51bfacc584f54083da0b3317116089e3af6b4

#37 Updated by Peter Amstutz over 2 years ago

  • Since "dispatchcloud" is a Go package and not technically a "service" it should probably go in sdk/go or lib/ ?
  • ChooseInstanceType doesn't take scratch disk requirement into account.
  • The comment on SlurmNodeTypeFeatureKludge says "to make a feature name valid, we can add it to a dummy node ("compute0"), then remove it". However I don't see why we can't simply assign all the constraints to compute0 once and then leave them there (seems less likely to break in the future).
  • You want to use size.id here (the libcloud ec2 driver has non-unique names such as "Large Instance" for every generation m1.large through m5.large).
    def _update_slurm_size_attrs(self, nodename, size):
        self._update_slurm_node(nodename, [
            'Weight=%i' % int(size.price * 1000),
            'Features=instancetype=' + size.name,
        ])

class ComputeNodeUpdateActor(SlurmMixin, UpdateActorBase):
    def sync_node(self, cloud_node, arvados_node):
        """Keep SLURM's node properties up to date.""" 
        hostname = arvados_node.get("hostname")
        features = arvados_node.get("slurm_node_features", "").split(",")
        sizefeature = "instancetype=" + cloud_node.size.name
...

I kicked off run-tests here: https://ci.curoverse.com/job/developer-run-tests/576/

There are test failures.

#38 Updated by Tom Clegg over 2 years ago

Peter Amstutz wrote:

  • Since "dispatchcloud" is a Go package and not technically a "service" it should probably go in sdk/go or lib/ ?

I was thinking dispatchcloud really should be a service, but for now we're only implementing the node selection code and kludging it into slurm+nodemanager. I'm also trying to move from "lots of code in package main where nobody can import it" to "services as packages". But yeah, lib might be less confusing.

  • The comment on SlurmNodeTypeFeatureKludge says "to make a feature name valid, we can add it to a dummy node ("compute0"), then remove it". However I don't see why we can't simply assign all the constraints to compute0 once and then leave them there (seems less likely to break in the future).

If we leave them there, and someone does actually have a node called compute0, we'll end up trying to run all containers on compute0. With add+remove there's still a race window where that could happen, but at least we don't get stuck in that state permanently.

"Simply leave them there" doesn't work because dynamically configured node features are forgotten during slurm controller restarts.

  • You want to use size.id here (the libcloud ec2 driver has non-unique names such as "Large Instance" for every generation m1.large through m5.large).

Ah, good catch. I updated testutil so it doesn't conveniently set name==id, which turned up one place in the GCE driver where it needlessly assumes name==id.

There are test failures.

12199-dispatch-to-node-type @ 03f277bf4b616f41ef7ed4b195d44dab83d16144

  • ChooseInstanceType doesn't take scratch disk requirement into account.

(todo)

#39 Updated by Tom Clegg over 2 years ago

12199-dispatch-to-node-type @ 2320020a96ae6438482d1c8bbdf43d9c4fd06482
  • merged master
  • account for scratch space ("tmp" mounts)

https://ci.curoverse.com/job/developer-run-tests/586/

#40 Updated by Tom Clegg over 2 years ago

Fixed services/api bug caught by test (job_uuid not returned in #update response).

12199-dispatch-to-node-type @ ce5f1a58b22bcf19bcedfb5d8602290d4a3025ff

https://ci.curoverse.com/job/developer-run-tests/587/

#41 Updated by Peter Amstutz over 2 years ago

This looks good to merge, but let's add subtasks to configure & test on 4xphq and c97qk.

#42 Updated by Tom Clegg over 2 years ago

  • Status changed from In Progress to Resolved

#44 Updated by Peter Amstutz over 2 years ago

  • Status changed from Resolved to Feedback
  • Target version changed from 2018-02-14 Sprint to 2018-02-28 Sprint

Node manager is crashing:

2018-02-15_19:56:05.48271 2018-02-15 19:56:05 pykka[120115] DEBUG: Starting ComputeNodeSetupActor (urn:uuid:08d8f03b-d789-4c33-a3fe-918f34a568cf)
2018-02-15_19:56:05.53990 2018-02-15 19:56:05 ComputeNodeSetupActor.918f34a568cf[120115] WARNING: Re-raising error (no retry): Missing required parameter "uuid" 
2018-02-15_19:56:05.53992 Traceback (most recent call last):
2018-02-15_19:56:05.53992   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 80, in retry_wrapper
2018-02-15_19:56:05.53992     ret = orig_func(self, *args, **kwargs)
2018-02-15_19:56:05.53992   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 125, in prepare_arvados_node
2018-02-15_19:56:05.53993     body={}, assign_slot=True).execute()
2018-02-15_19:56:05.53993   File "/usr/lib/python2.7/dist-packages/googleapiclient/discovery.py", line 727, in method
2018-02-15_19:56:05.53993     raise TypeError('Missing required parameter "%s"' % name)
2018-02-15_19:56:05.53993 TypeError: Missing required parameter "uuid" 
2018-02-15_19:56:05.54003 2018-02-15 19:56:05 ComputeNodeSetupActor.918f34a568cf[120115] ERROR: Actor error Missing required parameter "uuid" 
2018-02-15_19:56:05.54036 2018-02-15 19:56:05 ComputeNodeSetupActor.918f34a568cf[120115] INFO: finished

#45 Updated by Anonymous over 2 years ago

  • Status changed from Feedback to Resolved

#46 Updated by Peter Amstutz over 2 years ago

Nope, something is still broken with node initialization on 4xphq.

#47 Updated by Peter Amstutz over 2 years ago

  • Status changed from Resolved to Feedback

#48 Updated by Peter Amstutz over 2 years ago

Possibly a false alarm. Nico is working on #13078. Going to leave this open until we have a successful run (in either configuration).

#50 Updated by Tom Clegg over 2 years ago

  • Status changed from Feedback to Resolved

#51 Updated by Peter Amstutz over 2 years ago

  • Related to Bug #13166: [node manager] wishlist should consist of top priority containers added

#52 Updated by Tom Morris about 2 years ago

  • Release set to 17

Also available in: Atom PDF