Feature #14324

[crunch-dispatch-cloud] Azure driver

Added by Tom Clegg 7 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/09/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto


Subtasks

Task #14661: Review 14324-cdc-azureResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolved01/28/2019

Related to Arvados - Story #13908: Replace SLURM for cloud job scheduling/dispatchingNew

Associated revisions

Revision e62dc0ff
Added by Peter Amstutz 4 months ago

Merge branch '14324-cdc-azure' refs #14324

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Tom Clegg 7 months ago

  • Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added

#2 Updated by Tom Clegg 7 months ago

  • Tracker changed from Bug to Feature

#3 Updated by Peter Amstutz 7 months ago

previous work:

13964-cdc-azure @ 2e609dce27fc6b9f39857cf7b0a9904719d2d526

#4 Updated by Tom Clegg 7 months ago

  • Description updated (diff)

#5 Updated by Tom Morris 7 months ago

  • Target version set to Arvados Future Sprints
  • Story points set to 3.0

Time boxed with 2 pts of tests.

#6 Updated by Tom Clegg 5 months ago

  • Related to Story #13908: Replace SLURM for cloud job scheduling/dispatching added

#7 Updated by Tom Morris 5 months ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

#8 Updated by Tom Morris 5 months ago

  • Assigned To set to Peter Amstutz

#10 Updated by Peter Amstutz 4 months ago

A few things I realized I still need to do:

  • Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile
  • Document how to run individual test cases against real cloud
  • ManageBlobs should run periodically in the background
  • Logging probably needs to be cleaned up
  • Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)

#11 Updated by Peter Amstutz 4 months ago

Peter Amstutz wrote:

A few things I realized I still need to do:

  • Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile

Fixed

  • Document how to run individual test cases against real cloud

Done

  • ManageBlobs should run periodically in the background

Done

  • Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)

It appears this is handled by the SDK.

  • Logging probably needs to be cleaned up

Getting rid of spurious development logging, but need to decide what to do about background tasks that can experience errors

#12 Updated by Peter Amstutz 4 months ago

#14 Updated by Lucas Di Pentima 4 months ago

Reviewing f398e9d75d32fbfa6804041353997bf74195e489

  • Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns [no tests to run], I think it’s expected to run anyways, right?
  • On file cloud/azure.go:
    • Line 162: Are there other error statuses that would require throttling?
    • Line 169: Should that conditional need to be reversed?
    • Lines 173-177: Should that be an else block on line 171?
    • Line 258: Shouldn’t ManageBlobs() be called one more time before returning?
    • Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?

#15 Updated by Peter Amstutz 4 months ago

Lucas Di Pentima wrote:

Reviewing f398e9d75d32fbfa6804041353997bf74195e489

  • Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns [no tests to run], I think it’s expected to run anyways, right?

There was a missing file that called check.TestingT(t). Fixed.

  • On file cloud/azure.go:
    • Line 162: Are there other error statuses that would require throttling?

I don't know. Node manager goes into a backoff-and-retry loop with error codes >= 500, but in crunch-dispatch-cloud the scheduler handles retries and doesn't do any backoff. We could react to 500 errors as "wait and try again in 20 seconds".

  • Line 169: Should that conditional need to be reversed?
  • Lines 173-177: Should that be an else block on line 171?

Yes, good catch.

  • Line 258: Shouldn’t ManageBlobs() be called one more time before returning?

Not sure. Depends on whether it is more desirable that when context is Done (because Stop() was called) it should terminate right away, or that it maximizes the opportunity to garbage collect blobs.

  • Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?

It is parallelizing to 4 threads for the case where it is deleting a batch of NICs or blobs and individual operations have latency. If an individual operation fails, it will get retried on next call to ManageNICs or ManageBlobs.

14324-cdc-azure @ 6756ee790a6c685b96bc1ec80b67e7d6c94e9846

#16 Updated by Peter Amstutz 4 months ago

  • Status changed from New to In Progress

#17 Updated by Lucas Di Pentima 4 months ago

I think there's a file missing:

----------------------------------------------------------------------
FAIL: azure_test.go:137: AzureInstanceSetSuite.TestCreate

azure_test.go:144:
    c.Assert(err, check.IsNil)
... value *os.PathError = &os.PathError{Op:"open", Path:"azconfig_sshkey.pub", Err:0x2} ("open azconfig_sshkey.pub: no such file or directory")

time="2019-01-15T09:36:29-05:00" level=warning msg="Couldn't get account keys" error="storage.AccountsClient#ListKeys: Invalid input: autorest/validation: validation failed: parameter=resourceGroupName constraint=MinLength value=\"\" details: value length must be greater than or equal to 1" 
OOPS: 8 passed, 1 FAILED
--- FAIL: Test (0.00s)

#18 Updated by Peter Amstutz 4 months ago

Lucas Di Pentima wrote:

I think there's a file missing:

[...]

Ah yea, fixed. 14324-cdc-azure @ a27348943c6e35eee983d6a333eb6977394c3f13

#19 Updated by Lucas Di Pentima 4 months ago

This LGTM, thanks!

#20 Updated by Peter Amstutz 4 months ago

  • Status changed from In Progress to Resolved

#21 Updated by Tom Morris 3 months ago

  • Release set to 15

Also available in: Atom PDF