Idea #7478
closed[Node Manager] Creates compute nodes using AWS spot instances
Description
Functional requirements:
- Requests spot instances, waits for those requests to be fulfilled (minutes?) and launches the instances as compute nodes.
- For the initial implementation, just bid the standard price rather than trying to design a fancy bidding strategy. We'll still get the cost benefit as long as the spot price is lower.
- When the bid price is exceeded (hopefully rarely/never), we're likely to lose our entire fleet of compute instances and, perhaps, not be able to start any until demand subsides enough to cause the spot prices to go down. In this scenario, we'll need some configuration knobs to control whether to fall back to on-demand instances, wait for spot instances to become available again, etc.
Implementation details:
- Enhance libcloud to support AWS spot instances. (Done)
- API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.
- Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud support to request spot instances when needed. No arvados-cwl-runner required.
- Nodemanager spot instance handling:
[Size <name>]
sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only- Each size section will have a boolean “preemptable” attribute, defaulting to False.
- Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
- Update ec2 driver to pass the the
ex_spot_marke=True
parameter on the libcloud create_node call
- Update documentation explaining nodemanager config file format changes
Related issues
Updated by Tom Morris almost 7 years ago
- Subject changed from [Node Manager] Creates compute nodes from spot instances to [Node Manager] Creates compute nodes using AWS spot instances
- Description updated (diff)
- Target version set to To Be Groomed
Updated by Tom Morris over 6 years ago
Although there's not support in libcloud, it is available in boto, which might be another option: http://boto.cloudhackers.com/en/latest/ref/ec2.html
Updated by Lucas Di Pentima over 6 years ago
- Using Boto3: http://boto3.readthedocs.io/en/latest/index.html
- Pros:
- Full fledged AWS library with spot support (http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.Client.request_spot_instances)
- Seems to be the “official” python library (AWS docs points to it from their documentation site)
- Cons:
- It’s integration into nodemanager may complicate the code further
- Additional dependency
- Pros:
- Expanding libcloud (maybe reusing https://github.com/muccg/libcloud-drivers (Apache licensed) - didn’t get to test it yet, but they’re just a few lines of code):
- Pros
- It’s supposedly easy, as mentioned on the mailing list (although message it’s a bit old): https://mail-archives.apache.org/mod_mbox/libcloud-dev/201106.mbox/%3CBANLkTinzMApt5EggweEuooX2siFERbuSvQ@mail.gmail.com%3E
- Would fit on the rest of nodemanager’s mechanics
- Spot API designed to be similar to On Demand API (https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_RequestSpotInstances.html & https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html)
- Cons:
- No one seems to have tried integrating muccg's prototype into libcloud before, is that a sign of trouble ahead?
- Didn’t get to read Spot docs too deeply, but maybe their internals changed over time and have diverged from what libcloud does with EC2 driver.
- Pros
- My opinion: We should time box a test to see if libcloud can be made to work with these APIs, if that's possible, I think it will take less effort than adding Boto3 and also we would be contributing to a project that we’re already invested in.
Updated by Tom Morris over 6 years ago
We'll pursue the libcloud implementation option and implement spot instances using the default bid price (ie the on demand price).
API server will have a config option which specifies whether spot instances are enabled or not. If they are enabled, child containers will get created with the spot instances scheduling parameter set.
Spot instances will be their own instance type. Node manager needs to manage instance types separately from the libcloud-specified instance type that it currently does. Node manager will use the new libcloud aupport to request spot instances when when needed. No arvados-cwl-runner required.
Updated by Tom Morris over 6 years ago
- Blocked by Idea #13051: Spike - Investigate/prototype AWS spot instance support in libcloud added
Updated by Tom Morris over 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
Updated by Lucas Di Pentima over 6 years ago
Nodemanager refactoring/updates:
- Nodemanager spot instance handling:
[Size <name>]
sections on the config use instance types as <name>: decouple that and add it as instance_type attribute inside the section leaving <name> for description purposes only- Each size section will have a boolean “preemptable” attribute, defaulting to False.
- Update ServerCalculator & related code so that the instance type is not the unique id of a "nodesize"
- Update ec2 driver to pass the the
ex_spot_marke=True
parameter on the libcloud create_node call
- Update documentation explaining nodemanager config file format changes
- Tests
Updated by Lucas Di Pentima over 6 years ago
- Description updated (diff)
- Story points changed from 5.0 to 3.0
Updated by Tom Morris over 6 years ago
- Target version changed from Arvados Future Sprints to 2018-05-23 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint
Updated by Lucas Di Pentima over 6 years ago
Updates at 3950ffc94 - Branch 7478-anm-spot-instances
- Updated
libcloud
version dependency to use our fork with AWS Spot Instances support - Added support for a
preemptable
scheduling parameter on the API server - Added support on Go SDK &
dispatchcloud
- Modified nodemanager to detach node size from instance types, adding the
preemptable
parameter. - Updated the EC2 driver to check for the
preemptable
parameter and ask for Spot instances when needed.
I'm hopeful that propagating node sizes metadata by passing the CloudSizeWrapper object is a good approach. Unit tests are failing because of this (I don't want to start correcting them before confirming that's a good approach), but integration tests are passing.
Updated by Peter Amstutz over 6 years ago
- Not your fault, but a method named
validate_scheduling_parameters
that isbefore_validation
and not part ofvalidate
is confusing. Validations shouldn't change parameter values (but it isn't a technically a validation step...) Specifically I'm not sure if errors.add() does what you expect when it appears in abefore_validation
rather than avalidate
. Would you mind cleaning that up so the record adjustments are in before_validate and the value checks are in validate?
- A brief comment about the intention of setting/checking the preemptable flag would be helpful because the logic is slightly convoluted.
- Do we really want to totally disallow making top level containers preemptable, or just not assign them as preemptable by default? Seems like if it is explicitly set in the request, we should honor it.
- It looks like CloudSizeWrapper is will still use the value of "id" from the underlying NodeSize object rather than the name used in the "[Size foo]" section title. I think if you add something like
size_spec['id'] = sec_words[1]
in NodeManagerConfig.node_sizes() then it will use the user-supplied id.
Updated by Peter Amstutz over 6 years ago
- Is it necessary to set
instance_type
on CloudSizeWrapper? After using it to look up the corresponding libcloud NodeSize in NodeManagerConfig.node_sizes(), the instance_type field seems to be redundant with thereal
size object.
- Additionally, the use of "instance_type" seems to be inconsistent, because when we get it from runtime constraints, it is the Arvados configuration-assigned name of the size, not the cloud provider size id.
- In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.
Updated by Lucas Di Pentima over 6 years ago
Updates at 73872ccc5bb6b80a6049b44b0113085a9c2b6934
Test run: https://ci.curoverse.com/job/developer-run-tests/734/
- Cleaned up validation code on API server
- Avoid redundant attribute
instance_type
onCloudSizeWrapper
- Override CloudSizeWrapper id with config Size name
- Set
arvados_node_size
tag on node creation to have a reference to the Arvados assigned node size - Use the newly added tag to get the Arvados assigned node size when receiving the node list
Tests are pending
Updated by Peter Amstutz over 6 years ago
- I think this is backwards, should be "child containers" or (to align more closely with the logic) "containers with parent containers".
# If preemptable instances (eg: AWS Spot Instances) are allowed, # automatically ask them on non-child containers by default.
- I don't think this is is correct:
self.scheduling_parameters['preemptable'] ||= true
Because if 'preemptable' is 'false' it will be assigned 'true'. I think we want:
if Rails.configuration.preemptable_instances and !self.requesting_container_uuid.nil? and self.scheduling_parameters['preemptable'].nil? self.scheduling_parameters['preemptable'] = true end
This previous comments isn't addressed:
In list_nodes() for ec2, azure and gce we map back from the reported instance size to our node size object (each does it in a slightly different way, of course). However, we need to start mapping back to our arvados-assigned instance type, not the cloud type. This means (a) ComputeNodeDriver.sizes should correspond to ServerCalculator.cloud_sizes (b) we need to store the arvados-assigned instance type on the node as a tag, and use that rather than the cloud's own response.
I see you are setting arvados_node_size
in tags, but not reading it back in list_nodes()
. This is a problem because list_nodes()
is used to determine whether to start or stop nodes. If we define two node types "m4.large.preemptable" and "m4.large.reserved" but list_nodes()
only returns m4.large
then it won't match either size.
Updated by Peter Amstutz over 6 years ago
Followup to last comment: looking up the "arvados node size" happens in CloudNodeListMonitorActor, so that should work.
What happens if someone reconfigures the system and restarts node manager and you get back an arvados_node_size you don't recognize any more? The correct behavior in that case should be to shut the node down.
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-06-06 Sprint to 2018-06-20 Sprint
Updated by Peter Amstutz over 6 years ago
(04:10:32 PM) lucas: tetron: re:shutting down nodes that don't include a recognized arvados_node_size (last comment at https://dev.arvados.org/issues/7478#note-23), is it a correct approach to just call the destroy_node from CloudNodeListMonitorActor?
(04:11:35 PM) tetron: no
(04:12:24 PM) tetron: welll
(04:12:39 PM) lucas: tetron: Should I assign a proper status so that the pairing mechanism kills it or simething like that?
(04:13:54 PM) tetron: if we can do that through the "I am eligible for shutdown" interaction between ComputeNodeMonitorActor and DaemonActor that would be best
(04:14:53 PM) tetron: given how much effort we've spent handling various cloud failure modes I am very hesitant to add another place where we make a cloud API call
(04:15:23 PM) tetron: because then we're back to "oops we got a weird error and now nodemanager is in a death spiral"
(04:16:08 PM) tetron: remember it does create a ComputeNodeMonitorActor for every node, paired or not
(04:16:54 PM) tetron: so it can go through the normal mechanism of discovering the node in the node list, creating a ComputeNodeMonitorActor, then have the MonitorActor decide the node shouldn't exist, and tell daemon "please shut me down"
(04:18:39 PM) lucas: ok, I was trying to kill it as soon as the size is confirmed that is not recognizable because find_size returns None and will create problems when other parts of the code try to access it, I'll look for that approach
(04:19:02 PM) tetron: that's understandable
(04:19:20 PM) tetron: maybe have an "invalid size" stand-in
(04:19:54 PM) lucas: Yes, that could work. Thanks
Updated by Lucas Di Pentima over 6 years ago
Updates at 17f521d7f
Test run: https://ci.curoverse.com/job/developer-run-tests/747/
Since node-22
, the updates are:
- Updated api server CR's default preemptable setting logic as suggested
- When a cloud node has an unrecognizable
arvados_node_size
tag, instead of assigning None as its.size
, set anInvalidCloudSize
instance, so thatget_state()
returns'down'
and the node get properly shutdown - Added tests
Updated by Lucas Di Pentima over 6 years ago
Updates at b70f9ce54
Test run: https://ci.curoverse.com/job/developer-run-tests/748/
- Fixed a GCE driver issue discovered when running integration tests.
Updated by Peter Amstutz over 6 years ago
Reviewing 7478-anm-spot-instances @ b70f9ce54f1f672b423999e6c07b2f0127b76666
- The check for "self.cloud_node.size.id == 'invalid'" should be in shutdown_eligible() instead of get_state().
Rest LGTM
Updated by Lucas Di Pentima over 6 years ago
Updates at 71db70126
Test run: https://ci.curoverse.com/job/developer-run-tests/749/
Addressed above suggestions making shutdown_eligible()
the responsible of checking for an invalid cloud size. Updated test.
Updated by Lucas Di Pentima over 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|42a0609a6e287a82ed565413c7392d40141388ae.
Updated by Nico César over 6 years ago
deployed 1.1.4.20180612182441-2 and I see this error:
manage.4xphq:/etc/sv# systemctl restart arvados-node-manager ; journalctl -u arvados-node-manager -f -- Logs begin at Tue 2018-06-05 10:34:26 UTC. -- Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Double Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Extra Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Compute Optimized Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 0, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 NodeManagerDaemonActor.a06ef1818b7a[8607] INFO: Large Instance: wishlist 1, up 0 (booting 0, unpaired 0, idle 0, busy 0), down 0, shutdown 0 Jun 12 21:13:00 manage.4xphq.arvadosapi.com env[8606]: 2018-06-12 21:13:00 JobQueueMonitorActor.140274303566672[8607] INFO: got response with 1 items in 0.254546880722 seconds, next poll at 2018-06-12 21:13:10 Jun 12 21:13:00 manage.4xphq.arvadosapi.com systemd[1]: Stopping Arvados Node Manager Daemon... Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Stopped Arvados Node Manager Daemon. Jun 12 21:13:12 manage.4xphq.arvadosapi.com systemd[1]: Started Arvados Node Manager Daemon. Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: No handlers could be found for logger "status.Handler" Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 root[11289] INFO: /usr/bin/arvados-node-manager 1.1.4.20180612182441 started, libcloud 2.3.0 Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com Jun 12 21:13:12 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:12 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=akHRIUej%2BbWx2kgKam9btOFiP3rhUxQ8JlYhrX4S9ZA%3D&Action=DescribeImages&Owner.1=self HTTP/1.1" 200 None Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A12Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=QeXzl46I%2BGeKbjpmHxj5ZAerIlYKol6Z3uID%2Frr864M%3D&Action=DescribeSecurityGroups HTTP/1.1" 200 None Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=uiUkBMKy%2FZB6IPuwnt1MGzbj4Od7YL4%2BZ%2FtKG9XU%2BT4%3D&Action=DescribeSubnets HTTP/1.1" 200 None Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: Using cloud node sizes: Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.large, name=Large Instance, ram=8192 disk=0 bandwidth=None price=0.1 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.1, 'ram': 7782, 'bandwidth': None, 'cores': 2, 'disk': 0, 'id': 'm4.large.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.large, name=Compute Optimized Large Instance, ram=3840 disk=32 bandwidth=None price=0.105 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Large Instance', 'extra': {'cpu': 2}, 'scratch': 32000, 'price': 0.105, 'ram': 3648, 'bandwidth': None, 'cores': 2, 'disk': 32, 'id': 'c3.large'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.xlarge, name=Extra Large Instance, ram=16384 disk=0 bandwidth=None price=0.2 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.2, 'ram': 15564, 'bandwidth': None, 'cores': 4, 'disk': 0, 'id': 'm4.xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.xlarge, name=Compute Optimized Extra Large Instance, ram=7680 disk=80 bandwidth=None price=0.21 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Extra Large Instance', 'extra': {'cpu': 4}, 'scratch': 80000, 'price': 0.21, 'ram': 7296, 'bandwidth': None, 'cores': 4, 'disk': 80, 'id': 'c3.xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=m4.2xlarge, name=Double Extra Large Instance, ram=32768 disk=0 bandwidth=None price=0.4 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.4, 'ram': 31129, 'bandwidth': None, 'cores': 8, 'disk': 0, 'id': 'm4.2xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.2xlarge, name=Compute Optimized Double Extra Large Instance, ram=15360 disk=160 bandwidth=None price=0.42 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Double Extra Large Instance', 'extra': {'cpu': 8}, 'scratch': 160000, 'price': 0.42, 'ram': 14592, 'bandwidth': None, 'cores': 8, 'disk': 160, 'id': 'c3.2xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.4xlarge, name=Compute Optimized Quadruple Extra Large Instance, ram=30720 disk=320 bandwidth=None price=0.84 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Quadruple Extra Large Instance', 'extra': {'cpu': 16}, 'scratch': 320000, 'price': 0.84, 'ram': 29184, 'bandwidth': None, 'cores': 16, 'disk': 320, 'id': 'c3.4xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': False, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 arvnodeman.jobqueue[11289] INFO: {'real': <NodeSize: id=c3.8xlarge, name=Compute Optimized Eight Extra Large Instance, ram=61440 disk=640 bandwidth=None price=1.68 driver=Amazon EC2 ...>, 'preemptable': True, 'name': 'Compute Optimized Eight Extra Large Instance', 'extra': {'cpu': 32}, 'scratch': 640000, 'price': 1.68, 'ram': 58368, 'bandwidth': None, 'cores': 32, 'disk': 640, 'id': 'c3.8xlarge.spot'} Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting TimedCallBackActor (urn:uuid:e79cfca2-e7db-4441-aaab-49fcbcee068e) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:8a03c978-fa6e-442e-85f1-25a89ac98acb) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:4e4f4b1b-add6-4a06-8439-0871117c6d41) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting JobQueueMonitorActor (urn:uuid:2a47f596-37a8-49d9-9e97-526f2e85e829) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting ComputeNodeUpdateActor (urn:uuid:92794057-f151-4d7b-8366-a7928bd47f1c) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 JobQueueMonitorActor.140593208914768[11289] DEBUG: sending request Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] DEBUG: sending request Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365 subscribed to all events Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] DEBUG: sending request Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: Starting new HTTPS connection (1): ec2.us-east-1.amazonaws.com Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting NodeManagerDaemonActor (urn:uuid:e27ac108-d616-48d5-aef5-e1a8b77a0365) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Registered WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 pykka[11289] DEBUG: Starting WatchdogActor (urn:uuid:ca05efc5-db63-412f-b0e1-4f56bb11f6c6) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] DEBUG: Daemon started Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?Filter.3.Value.1=4xphq&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Filter.1.Name=instance-state-name&Filter.2.Value.1=dynamic-compute&SignatureMethod=HmacSHA256&Filter.3.Name=tag%3Acluster&Signature=aOZkPquswRZvn7Fx6xGIAWAxZNUhNMHho%2FqweBdq5hQ%3D&Action=DescribeInstances&Filter.1.Value.1=running&SignatureVersion=2&Timestamp=2018-06-12T21%3A13%3A13Z&Version=2016-11-15&Filter.2.Name=tag%3Aarvados-class HTTP/1.1" 200 None Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 requests.packages.urllib3.connectionpool[11289] DEBUG: https://ec2.us-east-1.amazonaws.com:443 "GET /?SignatureVersion=2&AWSAccessKeyId=AKIAJCNUIVXKTYNJ5OSQ&Timestamp=2018-06-12T21%3A13%3A13Z&SignatureMethod=HmacSHA256&Version=2016-11-15&Signature=v0y0o%2Fa%2FhUU9MvgQS75zLDv%2FUsQYHEsNJDj9zxsJpPc%3D&Action=DescribeAddresses HTTP/1.1" 200 None Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 CloudNodeListMonitorActor.140593232598720[11289] ERROR: got error: global name 'InvalidCloudSize' is not defined - will try again in 20.0 seconds Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: Traceback (most recent call last): Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: File "/usr/lib/python2.7/dist-packages/arvnodeman/clientactor.py", line 99, in poll Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: response = self._send_request() Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: File "/usr/lib/python2.7/dist-packages/arvnodeman/nodelist.py", line 86, in _send_request Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: n.size = self._calculator.find_size(n.extra['arvados_node_size']) Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: File "/usr/lib/python2.7/dist-packages/arvnodeman/jobqueue.py", line 142, in find_size Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: return InvalidCloudSize() Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: NameError: global name 'InvalidCloudSize' is not defined Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 ArvadosNodeListMonitorActor.140593211085648[11289] INFO: got response with 48 items in 0.229659795761 seconds, next poll at 2018-06-12 21:13:23 Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3 Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0 Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-2fa53rvm0uaoxnl Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-tbrex80emflesql Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-xyhjrnam94g23h1 Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-jbgfjgqgefs6dzl Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-lg95nmgds6bdb4d Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-0446wy2b6ofp838 Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-spv6sghrbe05g5i Jun 12 21:13:13 manage.4xphq.arvadosapi.com env[11286]: 2018-06-12 21:13:13 NodeManagerDaemonActor.e1a8b77a0365[11289] INFO: Registering new Arvados node 4xphq-7ekkf-gym16nks2abf1c2
Updated by Nico César over 6 years ago
after monkeypatch
Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ba3fafcf2920.compute2.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute2.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-bnkig53t8l0x1ci with hostname compute2 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:c8b286db-e3b8-4c82-9dc8-ba3fafcf2920 subscribed to events for '4xphq-7ekkf-bnkig53t8l0x1ci' Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.e6ea226582fa.compute1.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute1.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-k0es9pjugpjv7f0 with hostname compute1 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:29b88fe4-700f-41f6-807e-e6ea226582fa subscribed to events for '4xphq-7ekkf-k0es9pjugpjv7f0' Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeMonitorActor.ab0072e44ed9.compute3.4xphq.arvadosapi.com[16677] DEBUG: Suggesting shutdown because node's size tag 'None' not recognizable Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 NodeManagerDaemonActor.3b78803a4fc8[16677] INFO: Cloud node compute3.4xphq.arvadosapi.com is now paired with Arvados node 4xphq-7ekkf-9ct5e14ouidq1x3 with hostname compute3 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ArvadosNodeListMonitorActor.139825145078608[16677] DEBUG: urn:uuid:a201ef36-87cc-4f9f-abb9-ab0072e44ed9 subscribed to events for '4xphq-7ekkf-9ct5e14ouidq1x3' Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid'] failed Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last): Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: subprocess.check_output(cmd) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/subprocess.py", line 219, in check_output Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: raise CalledProcessError(retcode, cmd, output=output) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute2', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid'] failed Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last): Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: subprocess.check_output(cmd) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/subprocess.py", line 219, in check_output Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: raise CalledProcessError(retcode, cmd, output=output) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute1', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: scontrol: error: Weight value (9999999000) is greater than 4294967280 Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: No changes specified Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: 2018-06-12 21:20:36 ComputeNodeUpdateActor.bedafdd32e4c[16677] ERROR: SLURM update ['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid'] failed Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: Traceback (most recent call last): Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/slurm.py", line 26, in _update_slurm_node Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: subprocess.check_output(cmd) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: File "/usr/lib/python2.7/subprocess.py", line 219, in check_output Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: raise CalledProcessError(retcode, cmd, output=output) Jun 12 21:20:36 manage.4xphq.arvadosapi.com env[16674]: CalledProcessError: Command '['scontrol', 'update', u'NodeName=compute3', 'Weight=9999999000', 'Features=instancetype=invalid']' returned non-zero exit status 1
Updated by Nico César over 6 years ago
I manually applied puppet branch 7478-spot-instances-4xphq into 4xphq and disabled puppet
we're testing this with lucas
Updated by Lucas Di Pentima over 6 years ago
Updates at 115a5e886 - branch 7478-invalid-size-not-defined
Test run: https://ci.curoverse.com/job/developer-run-tests/750/
- Fixes
InvalidCloudSize
instantiation - Fixes
arvados_node_size
tag retrieval - Adds node size related information on logs when referring to a size by name.
The scontrol
error message I believe it's related to stopping unrecognized nodes.
I did some more testing running normal (not preemptable) CRs on 4xphq and it seems that it's working OK. Just in case, I left nodemanager stopped.
I also added spot sizes on 4xphq c-d-s config to match those already added to nodemanager.
Pending: Test spot instances creation. Before enabling spot instances on child containers on the API server, we can add preemptable = true
to any "non-spot" cloud size on nodemanager, for example m4.large, and run something while keeping an eye on the AWS console. If that is successful, we could enable API server's preemptable_instances = true
configuration and check that child containers get their scheduling parameter as expected.
Updated by Nico César over 6 years ago
review at 115a5e8861ef0a46224b2cd64568b30c884908fb this looks a good bugfix to me.
ready to merge
Updated by Lucas Di Pentima over 6 years ago
Following tests with Nico, we've discovered an error when setting nodemanager's libcloud dependencies. I'll make a new branch for that.
Updated by Lucas Di Pentima over 6 years ago
Updates at 089b68192 - branch 7478-anm-libcloud-deps-fix
Test run: https://ci.curoverse.com/job/developer-run-tests/751/
Updated install dependency on nodemanager for libcloud fork with spot instance support.
Updated by Nico César over 6 years ago
Review at 089b68192 - branch 7478-anm-libcloud-deps-fix
LGTM
Updated by Lucas Di Pentima over 6 years ago
Branch 7478-s-preemptable-preemptible
- a8bfbac31
Test run: https://ci.curoverse.com/job/developer-run-tests/766/
As suggested by Tom, replaced the term 'preemptable' with 'preemptible'.
Also added config & documentation on nodemanager's EC2 example config file for spot instances.
Updated by Lucas Di Pentima over 6 years ago
Branch 7478-auto-preemptible-cr-fix
- 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/
API server wasn't automatically adding the preemptible
scheduling parameter on child container requests when 'Rails.configuration.preemptible_instances = true'
because of a callback ordering issue.
Updated by Lucas Di Pentima over 6 years ago
Further testing on 4xphq show that when the CR has preemptible=true
scheduling parameter, c-d-s isn't requesting the correct instance type, seemingly ignoring this parameter.
Updated by Lucas Di Pentima over 6 years ago
- Related to Bug #13649: c-d-s doesn't request a preemptible instance when it should added
Updated by Peter Amstutz over 6 years ago
Lucas Di Pentima wrote:
Branch
7478-auto-preemptible-cr-fix
- 36da5d97f623f0c2c944829ca8410a3bea388b19
Test run: https://ci.curoverse.com/job/developer-run-tests/770/API server wasn't automatically adding the
preemptible
scheduling parameter on child container requests when'Rails.configuration.preemptible_instances = true'
because of a callback ordering issue.
Specifically, :set_default_preemptible_scheduling_parameter would run before :set_requesting_container_uuid when it needs to run after
- I don't understand what the test changes have to do with the callback ordering change
- Seems like an opportunity to write the test that would have detected the mistake in the first place
Updated by Lucas Di Pentima over 6 years ago
Rebased and tried again: 29e80f471f1d70d1d1eda43b05e0f2e059564509
Test run: https://ci.curoverse.com/job/developer-run-tests/772/
set_requesting_container_uuid
and set_default_preemptible_scheduling_parameter
callbacks to run on before_save
, adding an extra check on set_requesting_container_uuid
to avoid reassigning the field so that both cases are taken into account:
- Create CR, and later change state to Committed
- Create CR with state=Committed
Added test for the newly fixed case.