Feature #8186

[Node Manager] Support (ephemeral) EBS storage for AWS node types that do not have instance storage, like the M4/C4 classes.

Added by Ward Vandewege over 6 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/14/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.50 h)
Story points:
1.0

Description

Currently node manager only distinguishes between cloud instance types. Enable the admin to specify the amount of additional storage for specific instance types on AWS.

[Size m4.large]
cores = 2
scratch = 500

Implementation:

Determine how instance storage is available by default for node type. If additional space is needed, attach an EBS device.

This is configured via ex_blockdevicemappings to libcloud create_node() & documented at https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_BlockDeviceMapping.html

Disks should be VolumeType: 'gp2' (General Purpose SSD), have DeleteOnTermination: true, and specify a VolumeSize: that makes up the difference between instance storage (if any) and the required space.

The compute node boot scripts are expected to discover both instance and EBS storage devices and combine them into a single logical partition / file system. In the above example, after boot time configuration the resulting node should have a single 500 GB file system for scratch space.


Subtasks

Task #11853: Ensure that libcloud 0.20.2dev3 is installed on all clustersResolvedNico César

Task #11828: Review 8186-nodemanager-ebsResolvedPeter Amstutz


Related issues

Copied to Arvados - Feature #10183: [Node Manager] Support ephemeral additional storage for Azure node types that do not have sufficient instance storageClosed10/04/2016

Associated revisions

Revision f054bc3d
Added by Peter Amstutz almost 5 years ago

Merge branch '8186-nodemanager-ebs' closes #8186

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Brett Smith over 6 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Peter Amstutz about 5 years ago

  • Description updated (diff)

#3 Updated by Tom Morris almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2017-06-21 sprint

#4 Updated by Tom Morris almost 5 years ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint

#5 Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)

#6 Updated by Peter Amstutz almost 5 years ago

  • Target version changed from 2017-07-05 sprint to 2017-06-21 sprint
  • Story points set to 1.0

#7 Updated by Peter Amstutz almost 5 years ago

  • Assigned To set to Peter Amstutz

#8 Updated by Peter Amstutz almost 5 years ago

2017-06-12 19:27:01 ComputeNodeMonitorActor.3be867299275.dynamic.compute.4xphq.arvadosapi.com[12399] DEBUG: Not eligible for shut down because node state is ('unpaired', 'closed', 'boot wait', 'idle exceeded')
2017-06-12 19:27:01 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: Sending create_node request for node size Medium Instance.
scratch, size 20 4
kw {'ex_userdata': 'https://4xphq.arvadosapi.com/arvados/v1/nodes/4xphq-7ekkf-hp20ntfzt46cuo7/ping?ping_secret=3x5mf7ig73ydwngh2nw6j8xvynffo2us5zuesoqe7kzfeaq4dm', 'ex_blockdevicemappings': [{'Ebs': {'DeleteOnTermination': True, 'VolumeType': 'gp2', 'VolumeSize': 16}}], 'name': 'testing2.4xphq.arvadosapi.com'}
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] WARNING: Re-raising error (no retry): InvalidBlockDeviceMapping: Missing device name
Traceback (most recent call last):
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/__init__.py", line 78, in retry_wrapper
    ret = orig_func(self, *args, **kwargs)
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/dispatch/__init__.py", line 133, in create_cloud_node
    self.arvados_node)
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/driver/__init__.py", line 181, in create_node
    raise create_error
BaseHTTPError: InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] ERROR: Actor error InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: finished

#9 Updated by Peter Amstutz almost 5 years ago

Fixed, set scratch space block device to /dev/xvdt

#10 Updated by Lucas Di Pentima almost 5 years ago

  • Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
  • File services/nodemanager/arvnodeman/computenode/driver/ec2.py
    • Line 73: Is Arvados/SLURM scratch value always an int? Or would it be convenient to force that division to be an int?
    • Line 79: gp2 Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
  • It seems that FakeAwsDriver isn’t used on an integration test, missing commit?

#11 Updated by Peter Amstutz almost 5 years ago

Lucas Di Pentima wrote:

  • Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'

Fixed.

  • File services/nodemanager/arvnodeman/computenode/driver/ec2.py
    • Line 73: Is Arvados/SLURM scratch value always an int? Or would it be convenient to force that division to be an int?

I coerced it to int() also added +1 to round up.

  • Line 79: gp2 Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?

Done.

  • It seems that FakeAwsDriver isn’t used on an integration test, missing commit?

I was using it for manual testing. It really just reports node sizes that look like ec2 nodes instead of the default (which look like Azure node sizes).

Now at 58a7e4f0854e392de979c531ad397bb508a77779

#12 Updated by Lucas Di Pentima almost 5 years ago

Just a couple of details:

  • Could you add a comment regarding EBS hardcoded limits? Maybe in the future that changes.
  • If we're accepting a request with more storage that we can provide, should we log a warning message?

Running service/nodemanager tests locally, one test fails:

======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 241, in test_arvados_node_not_cleaned_after_shutdown_cancelled
    self.check_success_flag(False, 2)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 197, in check_success_flag
    last_flag = self.shutdown_actor.success.get(self.TIMEOUT)
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
    compat.reraise(*self._data['exc_info'])
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
    exec('raise tp, value, tb')
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 431, in ask
    self.tell(message)
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 398, in tell
    raise ActorDeadError('%s not found' % self)
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:d7382f42-a9d0-47ec-b5b1-8ee97ccb8255) not found

The rest LGTM. Thanks.

#13 Updated by Peter Amstutz almost 5 years ago

  • Status changed from New to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:f054bc3d7d3d26962e62c2ea7c27214b08e85bb6.

Also available in: Atom PDF