Project

General

Profile

Actions

Feature #8186

closed

[Node Manager] Support (ephemeral) EBS storage for AWS node types that do not have instance storage, like the M4/C4 classes.

Added by Ward Vandewege about 9 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0

Description

Currently node manager only distinguishes between cloud instance types. Enable the admin to specify the amount of additional storage for specific instance types on AWS.

[Size m4.large]
cores = 2
scratch = 500

Implementation:

Determine how instance storage is available by default for node type. If additional space is needed, attach an EBS device.

This is configured via ex_blockdevicemappings to libcloud create_node() & documented at https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_BlockDeviceMapping.html

Disks should be VolumeType: 'gp2' (General Purpose SSD), have DeleteOnTermination: true, and specify a VolumeSize: that makes up the difference between instance storage (if any) and the required space.

The compute node boot scripts are expected to discover both instance and EBS storage devices and combine them into a single logical partition / file system. In the above example, after boot time configuration the resulting node should have a single 500 GB file system for scratch space.


Subtasks 2 (0 open2 closed)

Task #11853: Ensure that libcloud 0.20.2dev3 is installed on all clustersResolvedNico César06/16/2017Actions
Task #11828: Review 8186-nodemanager-ebsResolvedPeter Amstutz06/14/2017Actions

Related issues 1 (0 open1 closed)

Copied to Arvados - Feature #10183: [Node Manager] Support ephemeral additional storage for Azure node types that do not have sufficient instance storageClosed10/04/2016Actions
Actions #1

Updated by Brett Smith about 9 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Morris over 7 years ago

  • Target version changed from Arvados Future Sprints to 2017-06-21 sprint
Actions #4

Updated by Tom Morris over 7 years ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Actions #5

Updated by Peter Amstutz over 7 years ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-07-05 sprint to 2017-06-21 sprint
  • Story points set to 1.0
Actions #7

Updated by Peter Amstutz over 7 years ago

  • Assigned To set to Peter Amstutz
Actions #8

Updated by Peter Amstutz over 7 years ago

2017-06-12 19:27:01 ComputeNodeMonitorActor.3be867299275.dynamic.compute.4xphq.arvadosapi.com[12399] DEBUG: Not eligible for shut down because node state is ('unpaired', 'closed', 'boot wait', 'idle exceeded')
2017-06-12 19:27:01 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: Sending create_node request for node size Medium Instance.
scratch, size 20 4
kw {'ex_userdata': 'https://4xphq.arvadosapi.com/arvados/v1/nodes/4xphq-7ekkf-hp20ntfzt46cuo7/ping?ping_secret=3x5mf7ig73ydwngh2nw6j8xvynffo2us5zuesoqe7kzfeaq4dm', 'ex_blockdevicemappings': [{'Ebs': {'DeleteOnTermination': True, 'VolumeType': 'gp2', 'VolumeSize': 16}}], 'name': 'testing2.4xphq.arvadosapi.com'}
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] WARNING: Re-raising error (no retry): InvalidBlockDeviceMapping: Missing device name
Traceback (most recent call last):
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/__init__.py", line 78, in retry_wrapper
    ret = orig_func(self, *args, **kwargs)
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/dispatch/__init__.py", line 133, in create_cloud_node
    self.arvados_node)
  File "/home/tetron/arvados/services/nodemanager/arvnodeman/computenode/driver/__init__.py", line 181, in create_node
    raise create_error
BaseHTTPError: InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] ERROR: Actor error InvalidBlockDeviceMapping: Missing device name
2017-06-12 19:27:02 ComputeNodeSetupActor.d5fe26afe32c[12399] INFO: finished
Actions #9

Updated by Peter Amstutz over 7 years ago

Fixed, set scratch space block device to /dev/xvdt

Actions #10

Updated by Lucas Di Pentima over 7 years ago

  • Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'
  • File services/nodemanager/arvnodeman/computenode/driver/ec2.py
    • Line 73: Is Arvados/SLURM scratch value always an int? Or would it be convenient to force that division to be an int?
    • Line 79: gp2 Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?
  • It seems that FakeAwsDriver isn’t used on an integration test, missing commit?
Actions #11

Updated by Peter Amstutz over 7 years ago

Lucas Di Pentima wrote:

  • Several tests are failing with this message: AttributeError: 'MockSize' object has no attribute 'scratch'

Fixed.

  • File services/nodemanager/arvnodeman/computenode/driver/ec2.py
    • Line 73: Is Arvados/SLURM scratch value always an int? Or would it be convenient to force that division to be an int?

I coerced it to int() also added +1 to round up.

  • Line 79: gp2 Ebs sizes go from 1 to 16384 (as per the documentation), should we cap the requested size between these values?

Done.

  • It seems that FakeAwsDriver isn’t used on an integration test, missing commit?

I was using it for manual testing. It really just reports node sizes that look like ec2 nodes instead of the default (which look like Azure node sizes).

Now at 58a7e4f0854e392de979c531ad397bb508a77779

Actions #12

Updated by Lucas Di Pentima over 7 years ago

Just a couple of details:

  • Could you add a comment regarding EBS hardcoded limits? Maybe in the future that changes.
  • If we're accepting a request with more storage that we can provide, should we log a warning message?

Running service/nodemanager tests locally, one test fails:

======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 241, in test_arvados_node_not_cleaned_after_shutdown_cancelled
    self.check_success_flag(False, 2)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 197, in check_success_flag
    last_flag = self.shutdown_actor.success.get(self.TIMEOUT)
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
    compat.reraise(*self._data['exc_info'])
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
    exec('raise tp, value, tb')
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 431, in ask
    self.tell(message)
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 398, in tell
    raise ActorDeadError('%s not found' % self)
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:d7382f42-a9d0-47ec-b5b1-8ee97ccb8255) not found

The rest LGTM. Thanks.

Actions #13

Updated by Peter Amstutz over 7 years ago

  • Status changed from New to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:f054bc3d7d3d26962e62c2ea7c27214b08e85bb6.

Actions

Also available in: Atom PDF