Idea #7478: [Node Manager] Creates compute nodes using AWS spot instances - Arvados

Actions

Copy link

Idea #7478

closed

[Node Manager] Creates compute nodes using AWS spot instances

Added by Brett Smith over 9 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Lucas Di Pentima

Category:

Node Manager

Target version:

2018-06-20 Sprint

Start date:

05/25/2018

Due date:

Story points:

3.0

Release:

Arvados 1.2

Release relationship:

Auto

Description

Functional requirements:

Requests spot instances, waits for those requests to be fulfilled (minutes?) and launches the instances as compute nodes.
For the initial implementation, just bid the standard price rather than trying to design a fancy bidding strategy. We'll still get the cost benefit as long as the spot price is lower.
When the bid price is exceeded (hopefully rarely/never), we're likely to lose our entire fleet of compute instances and, perhaps, not be able to start any until demand subsides enough to cause the spot prices to go down. In this scenario, we'll need some configuration knobs to control whether to fall back to on-demand instances, wait for spot instances to become available again, etc.

Implementation details:

Enhance libcloud to support AWS spot instances. (Done)
API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.
Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud support to request spot instances when needed. No arvados-cwl-runner required.
Nodemanager spot instance handling:
- [Size <name>] sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only
- Each size section will have a boolean “preemptable” attribute, defaulting to False.
- Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
- Update ec2 driver to pass the the ex_spot_marke=True parameter on the libcloud create_node call
Update documentation explaining nodemanager config file format changes

Subtasks 1 (0 open — 1 closed)

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Tom Morris over 7 years ago

Subject changed from [Node Manager] Creates compute nodes from spot instances to [Node Manager] Creates compute nodes using AWS spot instances
Description updated (diff)
Target version set to To Be Groomed

Actions

Copy link

Updated by Tom Morris over 7 years ago

Tracker changed from Bug to Idea

Actions

Copy link

Updated by Tom Morris about 7 years ago

Although there's not support in libcloud, it is available in boto, which might be another option: http://boto.cloudhackers.com/en/latest/ref/ec2.html

Actions

Copy link

Updated by Lucas Di Pentima about 7 years ago

Using Boto3: http://boto3.readthedocs.io/en/latest/index.html
- Pros:
  - Full fledged AWS library with spot support (http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.Client.request_spot_instances)
  - Seems to be the “official” python library (AWS docs points to it from their documentation site)
- Cons:
  - It’s integration into nodemanager may complicate the code further
  - Additional dependency
Expanding libcloud (maybe reusing https://github.com/muccg/libcloud-drivers (Apache licensed) - didn’t get to test it yet, but they’re just a few lines of code):
- Pros
  - It’s supposedly easy, as mentioned on the mailing list (although message it’s a bit old): https://mail-archives.apache.org/mod_mbox/libcloud-dev/201106.mbox/%3CBANLkTinzMApt5EggweEuooX2siFERbuSvQ@mail.gmail.com%3E
  - Would fit on the rest of nodemanager’s mechanics
  - Spot API designed to be similar to On Demand API (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RequestSpotInstances.html & https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html)
- Cons:
  - No one seems to have tried integrating muccg's prototype into libcloud before, is that a sign of trouble ahead?
  - Didn’t get to read Spot docs too deeply, but maybe their internals changed over time and have diverged from what libcloud does with EC2 driver.
My opinion: We should time box a test to see if libcloud can be made to work with these APIs, if that's possible, I think it will take less effort than adding Boto3 and also we would be contributing to a project that we’re already invested in.

Actions

Copy link

Updated by Tom Morris about 7 years ago

We'll pursue the libcloud implementation option and implement spot instances using the default bid price (ie the on demand price).

API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.

Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud aupport to request spot instances when when needed. No arvados-cwl-runner required.

Actions

Copy link

Updated by Tom Morris about 7 years ago

Blocked by Idea #13051: Spike - Investigate/prototype AWS spot instance support in libcloud added

Actions

Copy link

Updated by Tom Morris about 7 years ago

Story points set to 5.0

Actions

Copy link

Updated by Lucas Di Pentima about 7 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by Tom Morris about 7 years ago

Target version changed from To Be Groomed to Arvados Future Sprints

Actions

Copy link

#11

Updated by Tom Morris almost 7 years ago

Description updated (diff)

Actions

Copy link

#12

Updated by Lucas Di Pentima almost 7 years ago

Nodemanager refactoring/updates:

Nodemanager spot instance handling:
- [Size <name>] sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only
- Each size section will have a boolean “preemptable” attribute, defaulting to False.
- Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
- Update ec2 driver to pass the the ex_spot_marke=True parameter on the libcloud create_node call
Update documentation explaining nodemanager config file format changes
Tests

Actions

Copy link

#13

Updated by Lucas Di Pentima almost 7 years ago

Description updated (diff)
Story points changed from 5.0 to 3.0

Actions

Copy link

#14

Updated by Tom Morris almost 7 years ago

Target version changed from Arvados Future Sprints to 2018-05-23 Sprint

Actions

Copy link

#15

Updated by Lucas Di Pentima almost 7 years ago

Assigned To set to Lucas Di Pentima

Actions

Copy link

#16

Updated by Lucas Di Pentima almost 7 years ago

Status changed from New to In Progress

Actions

Copy link

#17

Updated by Lucas Di Pentima almost 7 years ago

Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint

Actions

Copy link

#18

Updated by Lucas Di Pentima almost 7 years ago

Updates at 3950ffc94 - Branch 7478-anm-spot-instances

Updated libcloud version dependency to use our fork with AWS Spot Instances support
Added support for a preemptable scheduling parameter on the API server
Added support on Go SDK & dispatchcloud
Modified nodemanager to detach node size from instance types, adding the preemptable parameter.
Updated the EC2 driver to check for the preemptable parameter and ask for Spot instances when needed.

I'm hopeful that propagating node sizes metadata by passing the CloudSizeWrapper object is a good approach. Unit tests are failing because of this (I don't want to start correcting them before confirming that's a good approach), but integration tests are passing.

Actions

Copy link

#19

Updated by Peter Amstutz almost 7 years ago

Not your fault, but a method named validate_scheduling_parameters that is before_validation and not part of validate is confusing. Validations shouldn't change parameter values (but it isn't a technically a validation step...) Specifically I'm not sure if errors.add() does what you expect when it appears in a before_validation rather than a validate. Would you mind cleaning that up so the record adjustments are in before_validate and the value checks are in validate?

A brief comment about the intention of setting/checking the preemptable flag would be helpful because the logic is slightly convoluted.

Do we really want to totally disallow making top level containers preemptable, or just not assign them as preemptable by default? Seems like if it is explicitly set in the request, we should honor it.

It looks like CloudSizeWrapper is will still use the value of "id" from the underlying NodeSize object rather than the name used in the "[Size foo]" section title. I think if you add something like size_spec['id'] = sec_words[1] in NodeManagerConfig.node_sizes() then it will use the user-supplied id.

Actions

Copy link

#20

Updated by Peter Amstutz almost 7 years ago

Is it necessary to set instance_type on CloudSizeWrapper? After using it to look up the corresponding libcloud NodeSize in NodeManagerConfig.node_sizes(), the instance_type field seems to be redundant with the real size object.

Additionally, the use of "instance_type" seems to be inconsistent, because when we get it from runtime constraints, it is the Arvados configuration-assigned name of the size, not the cloud provider size id.

In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.

Actions

Copy link

#21

Updated by Lucas Di Pentima almost 7 years ago

Updates at 73872ccc5bb6b80a6049b44b0113085a9c2b6934
Test run: https://ci.curoverse.com/job/developer-run-tests/734/

Addressed comments above:

Cleaned up validation code on API server
Avoid redundant attribute instance_type on CloudSizeWrapper
Override CloudSizeWrapper id with config Size name
Set arvados_node_size tag on node creation to have a reference to the Arvados assigned node size
Use the newly added tag to get the Arvados assigned node size when receiving the node list

Tests are pending

Actions

Copy link

#22

Updated by Peter Amstutz almost 7 years ago

I think this is backwards, should be "child containers" or (to align more closely with the logic) "containers with parent containers".

      # If preemptable instances (eg: AWS Spot Instances) are allowed,
      # automatically ask them on non-child containers by default.

I don't think this is is correct:

self.scheduling_parameters['preemptable'] ||= true

Because if 'preemptable' is 'false' it will be assigned 'true'. I think we want:

if Rails.configuration.preemptable_instances and !self.requesting_container_uuid.nil? and self.scheduling_parameters['preemptable'].nil?
 self.scheduling_parameters['preemptable'] = true
end

This previous comments isn't addressed:

In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.

I see you are setting arvados_node_size in tags, but not reading it back in list_nodes(). This is a problem because list_nodes() is used to determine whether to start or stop nodes. If we define two node types "m4.large.preemptable" and "m4.large.reserved" but list_nodes() only returns m4.large then it won't match either size.

Actions

Copy link

#23

Updated by Peter Amstutz almost 7 years ago

Followup to last comment: looking up the "arvados node size" happens in CloudNodeListMonitorActor, so that should work.

What happens if someone reconfigures the system and restarts node manager and you get back an arvados_node_size you don't recognize any more? The correct behavior in that case should be to shut the node down.

Actions

Copy link

#24

Updated by Lucas Di Pentima almost 7 years ago

Target version changed from 2018-06-06 Sprint to 2018-06-20 Sprint

Actions

Copy link

#25

Updated by Peter Amstutz almost 7 years ago

(04:10:32 PM) lucas: tetron: re:shutting down nodes that don't include a recognized arvados_node_size (last comment at https://dev.arvados.org/issues/7478#note-23), is it a correct approach to just call the destroy_node from CloudNodeListMonitorActor?
(04:11:35 PM) tetron: no
(04:12:24 PM) tetron: welll
(04:12:39 PM) lucas: tetron: Should I assign a proper status so that the pairing mechanism kills it or simething like that?
(04:13:54 PM) tetron: if we can do that through the "I am eligible for shutdown" interaction between ComputeNodeMonitorActor and DaemonActor that would be best
(04:14:53 PM) tetron: given how much effort we've spent handling various cloud failure modes I am very hesitant to add another place where we make a cloud API call
(04:15:23 PM) tetron: because then we're back to "oops we got a weird error and now nodemanager is in a death spiral"
(04:16:08 PM) tetron: remember it does create a ComputeNodeMonitorActor for every node, paired or not
(04:16:54 PM) tetron: so it can go through the normal mechanism of discovering the node in the node list, creating a ComputeNodeMonitorActor, then have the MonitorActor decide the node shouldn't exist, and tell daemon "please shut me down"
(04:18:39 PM) lucas: ok, I was trying to kill it as soon as the size is confirmed that is not recognizable because find_size returns None and will create problems when other parts of the code try to access it, I'll look for that approach
(04:19:02 PM) tetron: that's understandable
(04:19:20 PM) tetron: maybe have an "invalid size" stand-in
(04:19:54 PM) lucas: Yes, that could work. Thanks

Actions

Copy link

#26

Updated by Lucas Di Pentima almost 7 years ago

Updates at 17f521d7f
Test run: https://ci.curoverse.com/job/developer-run-tests/747/

Since node-22, the updates are:

Updated api server CR's default preemptable setting logic as suggested
When a cloud node has an unrecognizable arvados_node_size tag, instead of assigning None as its .size, set an InvalidCloudSize instance, so that get_state() returns 'down' and the node get properly shutdown
Added tests

Actions

Copy link

#27

Updated by Lucas Di Pentima almost 7 years ago

Updates at b70f9ce54
Test run: https://ci.curoverse.com/job/developer-run-tests/748/

Fixed a GCE driver issue discovered when running integration tests.

Actions

Copy link

#28

Updated by Peter Amstutz almost 7 years ago

Reviewing 7478-anm-spot-instances @ b70f9ce54f1f672b423999e6c07b2f0127b76666

The check for "self.cloud_node.size.id == 'invalid'" should be in shutdown_eligible() instead of get_state().

Rest LGTM

Actions

Copy link

#29

Updated by Lucas Di Pentima almost 7 years ago

Updates at 71db70126
Test run: https://ci.curoverse.com/job/developer-run-tests/749/

Addressed above suggestions making shutdown_eligible() the responsible of checking for an invalid cloud size. Updated test.

Actions

Copy link

#30

Updated by Lucas Di Pentima almost 7 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|42a0609a6e287a82ed565413c7392d40141388ae.

Actions

Copy link

#31

Updated by Nico César almost 7 years ago

deployed 1.1.4.20180612182441-2 and I see this error:

manage.4xphq:/etc/sv# systemctl restart arvados-node-manager  ; journalctl -u arvados-node-manager -f
-- Logs begin at Tue 2018-06-05 10:34:26 UTC. --
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 1, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0
Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 JobQueueMonitorActor.140274303566672[8607] INFO: got response with 1 items in 0.254546880722 seconds, next poll at 2018-06-12 21:13:10
Jun 12 21:13:00 manage.4xphq.arvadosapi.com systemd[1]: Stopping Arvados Node Manager Daemon...
Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Stopped Arvados Node Manager Daemon.
Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Started Arvados Node Manager Daemon.
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: No handlers could be found for logger "status.Handler" 
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 root[11289] INFO: /usr/bin/arvados-node-manager 1.1.4.20180612182441 started, libcloud 2.3.0
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com
Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=akHRIUej%2BbWx2kgKam9btOFiP3rhUxQ8JlYhrX4S9ZA%3D&Action=DescribeImages&Owner.1=self HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=QeXzl46I%2BGeKbjpmHxj5ZAerIlYKol6Z3uID%2Frr864M%3D&Action=DescribeSecurityGroups HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=uiUkBMKy%2FZB6IPuwnt1MGzbj4Od7YL4%2BZ%2FtKG9XU%2BT4%3D&Action=DescribeSubnets HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: Using cloud node sizes:
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge.spot'}
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: sending request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6)
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] DEBUG: Daemon started
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?Filter.3.Value.1=4xphq&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Filter.1.Name=instance-state-name&Filter.2.Value.1=dynamic-compute&SignatureMethod=HmacSHA256&Filter.3.Name=tag%3Acluster&Signature=aOZkPquswRZvn7Fx6xGIAWAxZNUhNMHho%2FqweBdq5hQ%3D&Action=DescribeInstances&Filter.1.Value.1=running&SignatureVersion=2&Timestamp=2018-06-12T21%3A13%3A13Z&Version=2016-11-15&Filter.2.Name=tag%3Aarvados-class HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=v0y0o%2Fa%2FhUU9MvgQS75zLDv%2FUsQYHEsNJDj9zxsJpPc%3D&Action=DescribeAddresses HTTP/1.1" 200 None
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] ERROR: got error: global name 'InvalidCloudSize' is not defined - will try again in 20.0 seconds
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: Traceback (most recent call last):
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/clientactor.py", line 99, in poll
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     response = self._send_request()
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/nodelist.py", line 86, in _send_request
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     n.size = self._calculator.find_size(n.extra['arvados_node_size'])
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/jobqueue.py", line 142, in find_size
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]:     return InvalidCloudSize()
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: NameError: global name 'InvalidCloudSize' is not defined
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] INFO: got response with 48 items in 0.229659795761 seconds, next poll at 2018-06-12 21:13:23
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-2fa53rvm0uaoxnl
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-tbrex80emflesql
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-xyhjrnam94g23h1
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-jbgfjgqgefs6dzl
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-lg95nmgds6bdb4d
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-0446wy2b6ofp838
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-spv6sghrbe05g5i
Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-gym16nks2abf1c2

Actions

Copy link

#32

Updated by Nico César almost 7 years ago

after monkeypatch

Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ba3fafcf2920.compute2.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute2.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci with hostname compute2
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:c8b286db-e3b8-4c82-9dc8-ba3fafcf2920 subscribed to events for '4xphq-7ekkf-bnkig53t8l0x1ci'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.e6ea226582fa.compute1.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute1.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0 with hostname compute1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:29b88fe4-700f-41f6-807e-e6ea226582fa subscribed to events for '4xphq-7ekkf-k0es9pjugpjv7f0'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ab0072e44ed9.compute3.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute3.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3 with hostname compute3
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:a201ef36-87cc-4f9f-abb9-ab0072e44ed9 subscribed to events for '4xphq-7ekkf-9ct5e14ouidq1x3'
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid'] failed
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last):
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     subprocess.check_output(cmd)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:   File "/usr/lib/python2.7/subprocess.py", line 219, in check_output
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]:     raise CalledProcessError(retcode, cmd, output=output)
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1

Actions

Copy link

#33

Updated by Nico César almost 7 years ago

I manually applied puppet branch 7478-spot-instances-4xphq into 4xphq and disabled puppet

we're testing this with lucas

Actions

Copy link

#34

Updated by Lucas Di Pentima almost 7 years ago

Updates at 115a5e886 - branch 7478-invalid-size-not-defined
Test run: https://ci.curoverse.com/job/developer-run-tests/750/

Fixes InvalidCloudSize instantiation
Fixes arvados_node_size tag retrieval
Adds node size related information on logs when referring to a size by name.

The scontrol error message I believe it's related to stopping unrecognized nodes.

I did some more testing running normal (not preemptable) CRs on 4xphq and it seems that it's working OK. Just in case, I left nodemanager stopped.

I also added spot sizes on 4xphq c-d-s config to match those already added to nodemanager.

Pending: Test spot instances creation. Before enabling spot instances on child containers on the API server, we can add preemptable = true to any "non-spot" cloud size on nodemanager, for example m4.large, and run something while keeping an eye on the AWS console. If that is successful, we could enable API server's preemptable_instances = true configuration and check that child containers get their scheduling parameter as expected.

Actions

Copy link

#35

Updated by Nico César almost 7 years ago

review at 115a5e8861ef0a46224b2cd64568b30c884908fb this looks a good bugfix to me.

ready to merge

Actions

Copy link

#36

Updated by Lucas Di Pentima almost 7 years ago

Following tests with Nico, we've discovered an error when setting nodemanager's libcloud dependencies. I'll make a new branch for that.

Actions

Copy link

#37

Updated by Lucas Di Pentima almost 7 years ago

Updates at 089b68192 - branch 7478-anm-libcloud-deps-fix
Test run: https://ci.curoverse.com/job/developer-run-tests/751/

Updated install dependency on nodemanager for libcloud fork with spot instance support.

Actions

Copy link

#38

Updated by Nico César almost 7 years ago

Review at 089b68192 - branch 7478-anm-libcloud-deps-fix

LGTM

Actions

Copy link

#39

Updated by Lucas Di Pentima almost 7 years ago

Branch 7478-s-preemptable-preemptible - a8bfbac31
Test run: https://ci.curoverse.com/job/developer-run-tests/766/

As suggested by Tom, replaced the term 'preemptable' with 'preemptible'.
Also added config & documentation on nodemanager's EC2 example config file for spot instances.

Actions

Copy link

#40

Updated by Tom Clegg almost 7 years ago

LGTM

Actions

Copy link

#41

Updated by Lucas Di Pentima almost 7 years ago

Branch 7478-auto-preemptible-cr-fix - 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/

API server wasn't automatically adding the preemptible scheduling parameter on child container requests when 'Rails.configuration.preemptible_instances = true' because of a callback ordering issue.

Actions

Copy link

#42

Updated by Lucas Di Pentima almost 7 years ago

Further testing on 4xphq show that when the CR has preemptible=true scheduling parameter, c-d-s isn't requesting the correct instance type, seemingly ignoring this parameter.

Actions

Copy link

#43

Updated by Lucas Di Pentima almost 7 years ago

Related to Bug #13649: c-d-s doesn't request a preemptible instance when it should added

Actions

Copy link

#44

Updated by Peter Amstutz almost 7 years ago

Lucas Di Pentima wrote:

Branch 7478-auto-preemptible-cr-fix - 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/

API server wasn't automatically adding the preemptible scheduling parameter on child container requests when 'Rails.configuration.preemptible_instances = true' because of a callback ordering issue.

Specifically, :set_default_preemptible_scheduling_parameter would run before :set_requesting_container_uuid when it needs to run after

I don't understand what the test changes have to do with the callback ordering change
Seems like an opportunity to write the test that would have detected the mistake in the first place

Actions

Copy link

#45

Updated by Lucas Di Pentima almost 7 years ago

Rebased and tried again: 29e80f471f1d70d1d1eda43b05e0f2e059564509
Test run: https://ci.curoverse.com/job/developer-run-tests/772/

As talked on chat, moved both set_requesting_container_uuid and set_default_preemptible_scheduling_parameter callbacks to run on before_save, adding an extra check on set_requesting_container_uuid to avoid reassigning the field so that both cases are taken into account:

Create CR, and later change state to Committed
Create CR with state=Committed

Added test for the newly fixed case.

Actions

Copy link

#46

Updated by Tom Morris over 6 years ago

Release set to 13

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #7478

[Node Manager] Creates compute nodes using AWS spot instances

Updated by Tom Morris over 7 years ago

Updated by Tom Morris over 7 years ago

Updated by Tom Morris about 7 years ago

Updated by Lucas Di Pentima about 7 years ago

Updated by Tom Morris about 7 years ago

Updated by Tom Morris about 7 years ago

Updated by Tom Morris about 7 years ago

Updated by Lucas Di Pentima about 7 years ago

Updated by Tom Morris about 7 years ago

Updated by Tom Morris almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Tom Morris almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Nico César almost 7 years ago

Updated by Nico César almost 7 years ago

Updated by Nico César almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Nico César almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Nico César almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Tom Clegg almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Peter Amstutz almost 7 years ago

Updated by Lucas Di Pentima almost 7 years ago

Updated by Tom Morris over 6 years ago