Story #7478

[Node Manager] Creates compute nodes using AWS spot instances

Added by Brett Smith over 5 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
05/25/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release:
Release relationship:
Auto

Description

Functional requirements:

  • Requests spot instances, waits for those requests to be fulfilled (minutes?) and launches the instances as compute nodes.
  • For the initial implementation, just bid the standard price rather than trying to design a fancy bidding strategy. We'll still get the cost benefit as long as the spot price is lower.
  • When the bid price is exceeded (hopefully rarely/never), we're likely to lose our entire fleet of compute instances and, perhaps, not be able to start any until demand subsides enough to cause the spot prices to go down. In this scenario, we'll need some configuration knobs to control whether to fall back to on-demand instances, wait for spot instances to become available again, etc.

Implementation details:

  • Enhance libcloud to support AWS spot instances. (Done)
  • API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.
  • Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud support to request spot instances when needed. No arvados-cwl-runner required.
  • Nodemanager spot instance handling:
    • [Size <name>] sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only
    • Each size section will have a boolean “preemptable” attribute, defaulting to False.
    • Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
    • Update ec2 driver to pass the the ex_spot_marke=True parameter on the libcloud create_node call
  • Update documentation explaining nodemanager config file format changes

Subtasks

Task #13461: Review 7478-anm-spot-instancesResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #13649: c-d-s doesn't request a preemptible instance when it shouldResolved06/21/2018

Blocked by Arvados - Story #13051: Spike - Investigate/prototype AWS spot instance support in libcloudResolved04/18/2018

Associated revisions

Revision 42a0609a
Added by Lucas Di Pentima over 2 years ago

Merge branch '7478-anm-spot-instances'
Closes #7478

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

Revision 2c68e941
Added by Lucas Di Pentima over 2 years ago

Merge branch '7478-invalid-size-not-defined'
Refs #7478

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

Revision e135f4e0
Added by Lucas Di Pentima over 2 years ago

Merge branch '7478-anm-libcloud-deps-fix'
Refs #7478

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

Revision f9e94997
Added by Lucas Di Pentima over 2 years ago

Merge branch '7478-s-preemptable-preemptible'
Refs #7478

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Tom Morris about 3 years ago

  • Subject changed from [Node Manager] Creates compute nodes from spot instances to [Node Manager] Creates compute nodes using AWS spot instances
  • Description updated (diff)
  • Target version set to To Be Groomed

#3 Updated by Tom Morris about 3 years ago

  • Tracker changed from Bug to Story

#4 Updated by Tom Morris almost 3 years ago

Although there's not support in libcloud, it is available in boto, which might be another option: http://boto.cloudhackers.com/en/latest/ref/ec2.html

#5 Updated by Lucas Di Pentima almost 3 years ago

#6 Updated by Tom Morris almost 3 years ago

We'll pursue the libcloud implementation option and implement spot instances using the default bid price (ie the on demand price).

API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.

Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud aupport to request spot instances when when needed. No arvados-cwl-runner required.

#7 Updated by Tom Morris almost 3 years ago

  • Blocked by Story #13051: Spike - Investigate/prototype AWS spot instance support in libcloud added

#8 Updated by Tom Morris almost 3 years ago

  • Story points set to 5.0

#9 Updated by Lucas Di Pentima almost 3 years ago

  • Description updated (diff)

#10 Updated by Tom Morris over 2 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints

#11 Updated by Tom Morris over 2 years ago

  • Description updated (diff)

#12 Updated by Lucas Di Pentima over 2 years ago

Nodemanager refactoring/updates:

  • Nodemanager spot instance handling:
    • [Size <name>] sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only
    • Each size section will have a boolean “preemptable” attribute, defaulting to False.
    • Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
    • Update ec2 driver to pass the the ex_spot_marke=True parameter on the libcloud create_node call
  • Update documentation explaining nodemanager config file format changes
  • Tests

#13 Updated by Lucas Di Pentima over 2 years ago

  • Description updated (diff)
  • Story points changed from 5.0 to 3.0

#14 Updated by Tom Morris over 2 years ago

  • Target version changed from Arvados Future Sprints to 2018-05-23 Sprint

#15 Updated by Lucas Di Pentima over 2 years ago

  • Assigned To set to Lucas Di Pentima

#16 Updated by Lucas Di Pentima over 2 years ago

  • Status changed from New to In Progress

#17 Updated by Lucas Di Pentima over 2 years ago

  • Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint

#18 Updated by Lucas Di Pentima over 2 years ago

Updates at 3950ffc94 - Branch 7478-anm-spot-instances

  • Updated libcloud version dependency to use our fork with AWS Spot Instances support
  • Added support for a preemptable scheduling parameter on the API server
  • Added support on Go SDK & dispatchcloud
  • Modified nodemanager to detach node size from instance types, adding the preemptable parameter.
  • Updated the EC2 driver to check for the preemptable parameter and ask for Spot instances when needed.

I'm hopeful that propagating node sizes metadata by passing the CloudSizeWrapper object is a good approach. Unit tests are failing because of this (I don't want to start correcting them before confirming that's a good approach), but integration tests are passing.

#19 Updated by Peter Amstutz over 2 years ago

  • Not your fault, but a method named validate_scheduling_parameters that is before_validation and not part of validate is confusing. Validations shouldn't change parameter values (but it isn't a technically a validation step...) Specifically I'm not sure if errors.add() does what you expect when it appears in a before_validation rather than a validate. Would you mind cleaning that up so the record adjustments are in before_validate and the value checks are in validate?
  • A brief comment about the intention of setting/checking the preemptable flag would be helpful because the logic is slightly convoluted.
  • Do we really want to totally disallow making top level containers preemptable, or just not assign them as preemptable by default? Seems like if it is explicitly set in the request, we should honor it.
  • It looks like CloudSizeWrapper is will still use the value of "id" from the underlying NodeSize object rather than the name used in the "[Size foo]" section title. I think if you add something like size_spec['id'] = sec_words[1] in NodeManagerConfig.node_sizes() then it will use the user-supplied id.

#20 Updated by Peter Amstutz over 2 years ago

  • Is it necessary to set instance_type on CloudSizeWrapper? After using it to look up the corresponding libcloud NodeSize in NodeManagerConfig.node_sizes(), the instance_type field seems to be redundant with the real size object.
  • Additionally, the use of "instance_type" seems to be inconsistent, because when we get it from runtime constraints, it is the Arvados configuration-assigned name of the size, not the cloud provider size id.
  • In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.

#21 Updated by Lucas Di Pentima over 2 years ago

Updates at 73872ccc5bb6b80a6049b44b0113085a9c2b6934
Test run: https://ci.curoverse.com/job/developer-run-tests/734/

Addressed comments above:
  • Cleaned up validation code on API server
  • Avoid redundant attribute instance_type on CloudSizeWrapper
  • Override CloudSizeWrapper id with config Size name
  • Set arvados_node_size tag on node creation to have a reference to the Arvados assigned node size
  • Use the newly added tag to get the Arvados assigned node size when receiving the node list

Tests are pending

#22 Updated by Peter Amstutz over 2 years ago

  • I think this is backwards, should be "child containers" or (to align more closely with the logic) "containers with parent containers".
      # If preemptable instances (eg: AWS Spot Instances) are allowed,
      # automatically ask them on non-child containers by default.
  • I don't think this is is correct:
self.scheduling_parameters['preemptable'] ||= true

Because if 'preemptable' is 'false' it will be assigned 'true'. I think we want:

if Rails.configuration.preemptable_instances and !self.requesting_container_uuid.nil? and self.scheduling_parameters['preemptable'].nil?
 self.scheduling_parameters['preemptable'] = true
end 

This previous comments isn't addressed:

In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.

I see you are setting arvados_node_size in tags, but not reading it back in list_nodes(). This is a problem because list_nodes() is used to determine whether to start or stop nodes. If we define two node types "m4.large.preemptable" and "m4.large.reserved" but list_nodes() only returns m4.large then it won't match either size.

#23 Updated by Peter Amstutz over 2 years ago

Followup to last comment: looking up the "arvados node size" happens in CloudNodeListMonitorActor, so that should work.

What happens if someone reconfigures the system and restarts node manager and you get back an arvados_node_size you don't recognize any more? The correct behavior in that case should be to shut the node down.

#24 Updated by Lucas Di Pentima over 2 years ago

  • Target version changed from 2018-06-06 Sprint to 2018-06-20 Sprint

#25 Updated by Peter Amstutz over 2 years ago

(04:10:32 PM) lucas: tetron: re:shutting down nodes that don't include a recognized arvados_node_size (last comment at https://dev.arvados.org/issues/7478#note-23), is it a correct approach to just call the destroy_node from CloudNodeListMonitorActor?
(04:11:35 PM) tetron: no
(04:12:24 PM) tetron: welll
(04:12:39 PM) lucas: tetron: Should I assign a proper status so that the pairing mechanism kills it or simething like that?
(04:13:54 PM) tetron: if we can do that through the "I am eligible for shutdown" interaction between ComputeNodeMonitorActor and DaemonActor that would be best
(04:14:53 PM) tetron: given how much effort we've spent handling various cloud failure modes I am very hesitant to add another place where we make a cloud API call
(04:15:23 PM) tetron: because then we're back to "oops we got a weird error and now nodemanager is in a death spiral"
(04:16:08 PM) tetron: remember it does create a ComputeNodeMonitorActor for every node, paired or not
(04:16:54 PM) tetron: so it can go through the normal mechanism of discovering the node in the node list, creating a ComputeNodeMonitorActor, then have the MonitorActor decide the node shouldn't exist, and tell daemon "please shut me down"
(04:18:39 PM) lucas: ok, I was trying to kill it as soon as the size is confirmed that is not recognizable because find_size returns None and will create problems when other parts of the code try to access it, I'll look for that approach
(04:19:02 PM) tetron: that's understandable
(04:19:20 PM) tetron: maybe have an "invalid size" stand-in
(04:19:54 PM) lucas: Yes, that could work. Thanks

#26 Updated by Lucas Di Pentima over 2 years ago

Updates at 17f521d7f
Test run: https://ci.curoverse.com/job/developer-run-tests/747/

Since node-22, the updates are:

  • Updated api server CR's default preemptable setting logic as suggested
  • When a cloud node has an unrecognizable arvados_node_size tag, instead of assigning None as its .size, set an InvalidCloudSize instance, so that get_state() returns 'down' and the node get properly shutdown
  • Added tests

#27 Updated by Lucas Di Pentima over 2 years ago

Updates at b70f9ce54
Test run: https://ci.curoverse.com/job/developer-run-tests/748/

  • Fixed a GCE driver issue discovered when running integration tests.

#28 Updated by Peter Amstutz over 2 years ago

Reviewing 7478-anm-spot-instances @ b70f9ce54f1f672b423999e6c07b2f0127b76666

  • The check for "self.cloud_node.size.id == 'invalid'" should be in shutdown_eligible() instead of get_state().

Rest LGTM

#29 Updated by Lucas Di Pentima over 2 years ago

Updates at 71db70126
Test run: https://ci.curoverse.com/job/developer-run-tests/749/

Addressed above suggestions making shutdown_eligible() the responsible of checking for an invalid cloud size. Updated test.

#30 Updated by Lucas Di Pentima over 2 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#31 Updated by Nico César over 2 years ago

deployed 1.1.4.20180612182441-2 and I see this error:

manage.4xphq:/etc/sv# systemctl restart arvados-node-manager  ; journalctl -u arvados-node-manager -f
-- Logs begin at Tue 2018-06-05 10:34:26 UTC. --
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 1, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 JobQueueMonitorActor.140274303566672[8607] INFO: got response with 1 items in 0.254546880722 seconds, next poll at 2018-06-12 21:13:10
Jun 12 21:13:00 manage.4xphq.arvadosapi.com systemd[1]: Stopping Arvados Node Manager Daemon...
Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Stopped Arvados Node Manager Daemon.
Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Started Arvados Node Manager Daemon.
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: No handlers could be found for logger "status.Handler" 
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 root[11289] INFO: /usr/bin/arvados-node-manager 1.1.4.20180612182441 started, libcloud 2.3.0
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=akHRIUej%2BbWx2kgKam9btOFiP3rhUxQ8JlYhrX4S9ZA%3D&Action=DescribeImages&Owner.1=self HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=QeXzl46I%2BGeKbjpmHxj5ZAerIlYKol6Z3uID%2Frr864M%3D&Action=DescribeSecurityGroups HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=uiUkBMKy%2FZB6IPuwnt1MGzbj4Od7YL4%2BZ%2FtKG9XU%2BT4%3D&Action=DescribeSubnets HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: Using cloud node sizes:
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] DEBUG: Daemon started
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?Filter.3.Value.1=4xphq&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Filter.1.Name=instance-state-name&Filter.2.Value.1=dynamic-compute&SignatureMethod=HmacSHA256&Filter.3.Name=tag%3Acluster&Signature=aOZkPquswRZvn7Fx6xGIAWAxZNUhNMHho%2FqweBdq5hQ%3D&Action=DescribeInstances&Filter.1.Value.1=running&SignatureVersion=2&Timestamp=2018-06-12T21%3A13%3A13Z&Version=2016-11-15&Filter.2.Name=tag%3Aarvados-class HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=v0y0o%2Fa%2FhUU9MvgQS75zLDv%2FUsQYHEsNJDj9zxsJpPc%3D&Action=DescribeAddresses HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] ERROR: got error: global name 'InvalidCloudSize' is not defined - will try again in 20.0 seconds
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: Traceback (most recent call last):
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/clientactor.py", line 99, in poll
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     response = self._send_request()
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/nodelist.py", line 86, in _send_request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     n.size = self._calculator.find_size(n.extra['arvados_node_size'])
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/jobqueue.py", line 142, in find_size
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     return InvalidCloudSize()
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: NameError: global name 'InvalidCloudSize' is not defined
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] INFO: got response with 48 items in 0.229659795761 seconds, next poll at 2018-06-12 21:13:23
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-2fa53rvm0uaoxnl
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-tbrex80emflesql
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-xyhjrnam94g23h1
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-jbgfjgqgefs6dzl
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-lg95nmgds6bdb4d
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-0446wy2b6ofp838
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-spv6sghrbe05g5i
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-gym16nks2abf1c2

#32 Updated by Nico César over 2 years ago

after monkeypatch

Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ba3fafcf2920.compute2.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute2.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci with hostname compute2
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:c8b286db-e3b8-4c82-9dc8-ba3fafcf2920 subscribed to events for '4xphq-7ekkf-bnkig53t8l0x1ci'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.e6ea226582fa.compute1.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute1.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0 with hostname compute1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:29b88fe4-700f-41f6-807e-e6ea226582fa subscribed to events for '4xphq-7ekkf-k0es9pjugpjv7f0'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ab0072e44ed9.compute3.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute3.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3 with hostname compute3
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:a201ef36-87cc-4f9f-abb9-ab0072e44ed9 subscribed to events for '4xphq-7ekkf-9ct5e14ouidq1x3'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1

#33 Updated by Nico César over 2 years ago

I manually applied puppet branch 7478-spot-instances-4xphq into 4xphq and disabled puppet

we're testing this with lucas

#34 Updated by Lucas Di Pentima over 2 years ago

Updates at 115a5e886 - branch 7478-invalid-size-not-defined
Test run: https://ci.curoverse.com/job/developer-run-tests/750/

  • Fixes InvalidCloudSize instantiation
  • Fixes arvados_node_size tag retrieval
  • Adds node size related information on logs when referring to a size by name.

The scontrol error message I believe it's related to stopping unrecognized nodes.

I did some more testing running normal (not preemptable) CRs on 4xphq and it seems that it's working OK. Just in case, I left nodemanager stopped.

I also added spot sizes on 4xphq c-d-s config to match those already added to nodemanager.

Pending: Test spot instances creation. Before enabling spot instances on child containers on the API server, we can add preemptable = true to any "non-spot" cloud size on nodemanager, for example m4.large, and run something while keeping an eye on the AWS console. If that is successful, we could enable API server's preemptable_instances = true configuration and check that child containers get their scheduling parameter as expected.

#35 Updated by Nico César over 2 years ago

review at 115a5e8861ef0a46224b2cd64568b30c884908fb this looks a good bugfix to me.

ready to merge

#36 Updated by Lucas Di Pentima over 2 years ago

Following tests with Nico, we've discovered an error when setting nodemanager's libcloud dependencies. I'll make a new branch for that.

#37 Updated by Lucas Di Pentima over 2 years ago

Updates at 089b68192 - branch 7478-anm-libcloud-deps-fix
Test run: https://ci.curoverse.com/job/developer-run-tests/751/

Updated install dependency on nodemanager for libcloud fork with spot instance support.

#38 Updated by Nico César over 2 years ago

Review at 089b68192 - branch 7478-anm-libcloud-deps-fix

LGTM

#39 Updated by Lucas Di Pentima over 2 years ago

Branch 7478-s-preemptable-preemptible - a8bfbac31
Test run: https://ci.curoverse.com/job/developer-run-tests/766/

As suggested by Tom, replaced the term 'preemptable' with 'preemptible'.
Also added config & documentation on nodemanager's EC2 example config file for spot instances.

#40 Updated by Tom Clegg over 2 years ago

LGTM

#41 Updated by Lucas Di Pentima over 2 years ago

Branch 7478-auto-preemptible-cr-fix - 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/

API server wasn't automatically adding the preemptible scheduling parameter on child container requests when 'Rails.configuration.preemptible_instances = true' because of a callback ordering issue.

#42 Updated by Lucas Di Pentima over 2 years ago

Further testing on 4xphq show that when the CR has preemptible=true scheduling parameter, c-d-s isn't requesting the correct instance type, seemingly ignoring this parameter.

#43 Updated by Lucas Di Pentima over 2 years ago

  • Related to Bug #13649: c-d-s doesn't request a preemptible instance when it should added

#44 Updated by Peter Amstutz over 2 years ago

Lucas Di Pentima wrote:

Branch 7478-auto-preemptible-cr-fix - 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/

API server wasn't automatically adding the preemptible scheduling parameter on child container requests when 'Rails.configuration.preemptible_instances = true' because of a callback ordering issue.

Specifically, :set_default_preemptible_scheduling_parameter would run before :set_requesting_container_uuid when it needs to run after

  • I don't understand what the test changes have to do with the callback ordering change
  • Seems like an opportunity to write the test that would have detected the mistake in the first place

#45 Updated by Lucas Di Pentima over 2 years ago

Rebased and tried again: 29e80f471f1d70d1d1eda43b05e0f2e059564509
Test run: https://ci.curoverse.com/job/developer-run-tests/772/

As talked on chat, moved both set_requesting_container_uuid and set_default_preemptible_scheduling_parameter callbacks to run on before_save, adding an extra check on set_requesting_container_uuid to avoid reassigning the field so that both cases are taken into account:
  • Create CR, and later change state to Committed
  • Create CR with state=Committed

Added test for the newly fixed case.

#46 Updated by Tom Morris over 2 years ago

  • Release set to 13

Also available in: Atom PDF