Feature #8186
closed[Node Manager] Support (ephemeral) EBS storage for AWS node types that do not have instance storage, like the M4/C4 classes.
Description
Currently node manager only distinguishes between cloud instance types. Enable the admin to specify the amount of additional storage for specific instance types on AWS.
[Size m4.large] cores = 2 scratch = 500
Implementation:
Determine how instance storage is available by default for node type. If additional space is needed, attach an EBS device.
This is configured via ex_blockdevicemappings to libcloud create_node() & documented at https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_BlockDeviceMapping.html
Disks should be VolumeType: 'gp2' (General Purpose SSD), have DeleteOnTermination: true, and specify a VolumeSize: that makes up the difference between instance storage (if any) and the required space.
The compute node boot scripts are expected to discover both instance and EBS storage devices and combine them into a single logical partition / file system. In the above example, after boot time configuration the resulting node should have a single 500 GB file system for scratch space.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Morris over 7 years ago
- Target version changed from Arvados Future Sprints to 2017-06-21 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-07-05 sprint to 2017-06-21 sprint
- Story points set to 1.0
Updated by Peter Amstutz over 7 years ago
2017-06-12 19:27:01 ComputeNodeMonitorActor.3be867299275.dynamic.compute.4xphq.arvadosapi.com[12399] DEBUG: Not eligible for shut down because node state is ('unpaired', 'closed', 'boot wait', 'idle exceeded') 2017-06-12 19:27:01 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: Sending create_node request for node size Medium Instance. scratch, size 20 4 kw {'ex_userdata': 'https://4xphq.arvadosapi.com/arvados/v1/nodes/4xphq-7ekkf-hp20ntfzt46cuo7/ping?ping_secret=3x5mf7ig73ydwngh2nw6j8xvynffo2us5zuesoqe7kzfeaq4dm', 'ex_blockdevicemappings': [{'Ebs': {'DeleteOnTermination': True, 'VolumeType': 'gp2', 'VolumeSize': 16}}], 'name': 'testing2.4xphq.arvadosapi.com'} 2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] WARNING: Re-raising error (no retry): InvalidBlockDeviceMapping: Missing device name Traceback (most recent call last): File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/__init__.py", line 78, in retry_wrapper ret = orig_func(self, *args, **kwargs) File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/dispatch/__init__.py", line 133, in create_cloud_node self.arvados_node) File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/driver/__init__.py", line 181, in create_node raise create_error BaseHTTPError: InvalidBlockDeviceMapping: Missing device name 2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] ERROR: Actor error InvalidBlockDeviceMapping: Missing device name 2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: finished
Updated by Peter Amstutz over 7 years ago
Fixed, set scratch space block device to /dev/xvdt
Updated by Lucas Di Pentima over 7 years ago
- Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
- File
services/nodemanager/arvnodeman/computenode/driver/ec2.py
- Line 73: Is Arvados/SLURM
scratch
value always anint
? Or would it be convenient to force that division to be anint
? - Line 79:
gp2
Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
- Line 73: Is Arvados/SLURM
- It seems that
FakeAwsDriver
isn’t used on an integration test, missing commit?
Updated by Peter Amstutz over 7 years ago
Lucas Di Pentima wrote:
- Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
Fixed.
- File
services/nodemanager/arvnodeman/computenode/driver/ec2.py
- Line 73: Is Arvados/SLURM
scratch
value always anint
? Or would it be convenient to force that division to be anint
?
I coerced it to int() also added +1 to round up.
- Line 79:
gp2
Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
Done.
- It seems that
FakeAwsDriver
isn’t used on an integration test, missing commit?
I was using it for manual testing. It really just reports node sizes that look like ec2 nodes instead of the default (which look like Azure node sizes).
Updated by Lucas Di Pentima over 7 years ago
Just a couple of details:
- Could you add a comment regarding EBS hardcoded limits? Maybe in the future that changes.
- If we're accepting a request with more storage that we can provide, should we log a warning message?
Running service/nodemanager
tests locally, one test fails:
====================================================================== ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched return func(*args, **keywargs) File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 241, in test_arvados_node_not_cleaned_after_shutdown_cancelled self.check_success_flag(False, 2) File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 197, in check_success_flag last_flag = self.shutdown_actor.success.get(self.TIMEOUT) File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get compat.reraise(*self._data['exc_info']) File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise exec('raise tp, value, tb') File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 431, in ask self.tell(message) File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 398, in tell raise ActorDeadError('%s not found' % self) ActorDeadError: ComputeNodeShutdownActor (urn:uuid:d7382f42-a9d0-47ec-b5b1-8ee97ccb8255) not found
The rest LGTM. Thanks.
Updated by Peter Amstutz over 7 years ago
- Status changed from New to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:f054bc3d7d3d26962e62c2ea7c27214b08e85bb6.