Feature #14324
closed[crunch-dispatch-cloud] Azure driver
Description
Implement crunch-dispatch-cloud's cloud.Driver interface using Azure client library.
SeeUpdated by Tom Clegg over 6 years ago
- Related to Feature #14325: [crunch-dispatch-cloud] Dispatch containers to cloud VMs directly, without slurm or nodemanager added
Updated by Peter Amstutz over 6 years ago
previous work:
13964-cdc-azure @ 2e609dce27fc6b9f39857cf7b0a9904719d2d526
Updated by Tom Morris over 6 years ago
- Target version set to Arvados Future Sprints
- Story points set to 3.0
Time boxed with 2 pts of tests.
Updated by Tom Clegg about 6 years ago
- Related to Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Updated by Tom Morris about 6 years ago
- Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Updated by Peter Amstutz about 6 years ago
Updated by Peter Amstutz about 6 years ago
A few things I realized I still need to do:
- Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile
- Document how to run individual test cases against real cloud
- ManageBlobs should run periodically in the background
- Logging probably needs to be cleaned up
- Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)
Updated by Peter Amstutz about 6 years ago
Peter Amstutz wrote:
A few things I realized I still need to do:
- Testing lib/dispatchcloud doesn't run tests in sub-directories, so I overlooked a test that's failing to compile
Fixed
- Document how to run individual test cases against real cloud
Done
- ManageBlobs should run periodically in the background
Done
- Determine if the underlying SDK handles refreshing expired tokens automatically or if our code needs to do it (needed to do this in libcloud)
It appears this is handled by the SDK.
- Logging probably needs to be cleaned up
Getting rid of spurious development logging, but need to decide what to do about background tasks that can experience errors
Updated by Peter Amstutz about 6 years ago
14324-cdc-azure @ 443a0b96316ed46600dc5035193adae6ac4d1f74
https://ci.curoverse.com/view/Developer/job/developer-run-tests/1027/
Sorted out above issues, ready for review comments.
Updated by Peter Amstutz about 6 years ago
How about 14324-cdc-azure @ f398e9d75d32fbfa6804041353997bf74195e489
https://ci.curoverse.com/view/Developer/job/developer-run-tests/1031/
Updated by Lucas Di Pentima about 6 years ago
Reviewing f398e9d75d32fbfa6804041353997bf74195e489
- Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns
[no tests to run]
, I think it’s expected to run anyways, right? - On file
cloud/azure.go
:- Line 162: Are there other error statuses that would require throttling?
- Line 169: Should that conditional need to be reversed?
- Lines 173-177: Should that be an
else
block on line 171? - Line 258: Shouldn’t
ManageBlobs()
be called one more time before returning? - Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?
Updated by Peter Amstutz about 6 years ago
Lucas Di Pentima wrote:
Reviewing f398e9d75d32fbfa6804041353997bf74195e489
- Didn’t run the driver tests against the real cloud, as I don’t have an account on Azure. Should I ask ops for some credentials? OTOH, running them without credentials returns
[no tests to run]
, I think it’s expected to run anyways, right?
There was a missing file that called check.TestingT(t)
. Fixed.
- On file
cloud/azure.go
:
- Line 162: Are there other error statuses that would require throttling?
I don't know. Node manager goes into a backoff-and-retry loop with error codes >= 500, but in crunch-dispatch-cloud the scheduler handles retries and doesn't do any backoff. We could react to 500 errors as "wait and try again in 20 seconds".
- Line 169: Should that conditional need to be reversed?
- Lines 173-177: Should that be an
else
block on line 171?
Yes, good catch.
- Line 258: Shouldn’t
ManageBlobs()
be called one more time before returning?
Not sure. Depends on whether it is more desirable that when context is Done (because Stop() was called) it should terminate right away, or that it maximizes the opportunity to garbage collect blobs.
- Line 268: Are these 4 iterations a way to add 4 attempts to NIC & blob cleanup? Maybe this value should be configurable?
It is parallelizing to 4 threads for the case where it is deleting a batch of NICs or blobs and individual operations have latency. If an individual operation fails, it will get retried on next call to ManageNICs or ManageBlobs.
14324-cdc-azure @ 6756ee790a6c685b96bc1ec80b67e7d6c94e9846
Updated by Peter Amstutz about 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima about 6 years ago
I think there's a file missing:
---------------------------------------------------------------------- FAIL: azure_test.go:137: AzureInstanceSetSuite.TestCreate azure_test.go:144: c.Assert(err, check.IsNil) ... value *os.PathError = &os.PathError{Op:"open", Path:"azconfig_sshkey.pub", Err:0x2} ("open azconfig_sshkey.pub: no such file or directory") time="2019-01-15T09:36:29-05:00" level=warning msg="Couldn't get account keys" error="storage.AccountsClient#ListKeys: Invalid input: autorest/validation: validation failed: parameter=resourceGroupName constraint=MinLength value=\"\" details: value length must be greater than or equal to 1" OOPS: 8 passed, 1 FAILED --- FAIL: Test (0.00s)
Updated by Peter Amstutz almost 6 years ago
Lucas Di Pentima wrote:
I think there's a file missing:
[...]
Ah yea, fixed. 14324-cdc-azure @ a27348943c6e35eee983d6a333eb6977394c3f13
Updated by Peter Amstutz almost 6 years ago
- Status changed from In Progress to Resolved