Project

General

Profile

Actions

Feature #14324

closed

[crunch-dispatch-cloud] Azure driver

Added by Tom Clegg over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
3.0
Release relationship:
Auto


Subtasks 1 (0 open1 closed)

Task #14661: Review 14324-cdc-azureResolvedPeter Amstutz01/09/2019Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanagerResolvedTom Clegg01/28/2019Actions
Related to Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Actions #1

Updated by Tom Clegg over 6 years ago

  • Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added
Actions #2

Updated by Tom Clegg over 6 years ago

  • Tracker changed from Bug to Feature
Actions #3

Updated by Peter Amstutz over 6 years ago

previous work:

13964-cdc-azure @ 2e609dce27fc6b9f39857cf7b0a9904719d2d526

Actions #4

Updated by Tom Clegg over 6 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Morris over 6 years ago

  • Target version set to Arvados Future Sprints
  • Story points set to 3.0

Time boxed with 2 pts of tests.

Actions #6

Updated by Tom Clegg about 6 years ago

  • Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #7

Updated by Tom Morris about 6 years ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Actions #8

Updated by Tom Morris about 6 years ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Peter Amstutz about 6 years ago

A few things I realized I still need to do:

  • Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile
  • Document how to run individual test cases against real cloud
  • ManageBlobs should run periodically in the background
  • Logging probably needs to be cleaned up
  • Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)
Actions #11

Updated by Peter Amstutz about 6 years ago

Peter Amstutz wrote:

A few things I realized I still need to do:

  • Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile

Fixed

  • Document how to run individual test cases against real cloud

Done

  • ManageBlobs should run periodically in the background

Done

  • Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)

It appears this is handled by the SDK.

  • Logging probably needs to be cleaned up

Getting rid of spurious development logging, but need to decide what to do about background tasks that can experience errors

Actions #12

Updated by Peter Amstutz about 6 years ago

Actions #14

Updated by Lucas Di Pentima about 6 years ago

Reviewing f398e9d75d32fbfa6804041353997bf74195e489

  • Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns [no tests to run], I think it’s expected to run anyways, right?
  • On file cloud/azure.go:
    • Line 162: Are there other error statuses that would require throttling?
    • Line 169: Should that conditional need to be reversed?
    • Lines 173-177: Should that be an else block on line 171?
    • Line 258: Shouldn’t ManageBlobs() be called one more time before returning?
    • Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?
Actions #15

Updated by Peter Amstutz about 6 years ago

Lucas Di Pentima wrote:

Reviewing f398e9d75d32fbfa6804041353997bf74195e489

  • Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns [no tests to run], I think it’s expected to run anyways, right?

There was a missing file that called check.TestingT(t). Fixed.

  • On file cloud/azure.go:
    • Line 162: Are there other error statuses that would require throttling?

I don't know. Node manager goes into a backoff-and-retry loop with error codes >= 500, but in crunch-dispatch-cloud the scheduler handles retries and doesn't do any backoff. We could react to 500 errors as "wait and try again in 20 seconds".

  • Line 169: Should that conditional need to be reversed?
  • Lines 173-177: Should that be an else block on line 171?

Yes, good catch.

  • Line 258: Shouldn’t ManageBlobs() be called one more time before returning?

Not sure. Depends on whether it is more desirable that when context is Done (because Stop() was called) it should terminate right away, or that it maximizes the opportunity to garbage collect blobs.

  • Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?

It is parallelizing to 4 threads for the case where it is deleting a batch of NICs or blobs and individual operations have latency. If an individual operation fails, it will get retried on next call to ManageNICs or ManageBlobs.

14324-cdc-azure @ 6756ee790a6c685b96bc1ec80b67e7d6c94e9846

Actions #16

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Lucas Di Pentima almost 6 years ago

I think there's a file missing:

----------------------------------------------------------------------
FAIL: azure_test.go:137: AzureInstanceSetSuite.TestCreate

azure_test.go:144:
    c.Assert(err, check.IsNil)
... value *os.PathError = &os.PathError{Op:"open", Path:"azconfig_sshkey.pub", Err:0x2} ("open azconfig_sshkey.pub: no such file or directory")

time="2019-01-15T09:36:29-05:00" level=warning msg="Couldn't get account keys" error="storage.AccountsClient#ListKeys: Invalid input: autorest/validation: validation failed: parameter=resourceGroupName constraint=MinLength value=\"\" details: value length must be greater than or equal to 1" 
OOPS: 8 passed, 1 FAILED
--- FAIL: Test (0.00s)
Actions #18

Updated by Peter Amstutz almost 6 years ago

Lucas Di Pentima wrote:

I think there's a file missing:

[...]

Ah yea, fixed. 14324-cdc-azure @ a27348943c6e35eee983d6a333eb6977394c3f13

Actions #19

Updated by Lucas Di Pentima almost 6 years ago

This LGTM, thanks!

Actions #20

Updated by Peter Amstutz almost 6 years ago

  • Status changed from In Progress to Resolved
Actions #21

Updated by Tom Morris almost 6 years ago

  • Release set to 15
Actions

Also available in: Atom PDF