Bug #14920
closed[crunch-dispatch-cloud] New Azure instances always have state=unknown instead of state=booting
Description
Currently, when a-d-c uses the Azure driver, new instances have state=unknown (instead of the expected state=booting) until the boot/run probes pass.
The "unknown" state is intended to cover the case where the "list instances" call returns a previously unseen instance ID. In the Azure case, the "create VM" call does not even return the ID of the newly created instance until the instance has finished booting, so until then, the dispatcher's worker pool doesn't recognize that it corresponds to an outstanding "create" call.
Some different ways to address this:- In the Azure driver, return as soon as the instance ID is known, instead of waiting for it to boot. This is how the driver is expected to work, but the Azure client library might not make it easy.
- In the worker pool, when an unexpected instance ID appears, check whether its "secret token" tag matches an outstanding Create call. This would also cover the "list returns before create" race, which applies to all drivers.
The second option seems better.
It would also be worth documenting the expected driver behavior in the driver interface definition: Create() should generally return as soon as the new instance's ID is known, but must not return so early that a subsequent call to Instances() might not include the new instance.
Files
Updated by Tom Clegg over 5 years ago
- Category set to Crunch
- Target version set to To Be Groomed
Updated by Tom Clegg over 5 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version changed from To Be Groomed to Arvados Future Sprints
14920-unknown-booting-race @ e49978c5d9bece2a1db646f36cdf346414dd8813
Updated by Ward Vandewege over 5 years ago
Tom Clegg wrote:
14920-unknown-booting-race @ e49978c5d9bece2a1db646f36cdf346414dd8813
LGTM. I like that the code is a lot more elegant now!
Updated by Tom Clegg over 5 years ago
- File 14920-fixed.png 14920-fixed.png added
I noticed while testing that metrics didn't reflect the idle→running change right away when starting a container. With that fixed:
Updated by Tom Clegg over 5 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|64e72e2842da35ef1616a6a499731a6ed0a832b5.
Updated by Tom Morris over 5 years ago
- Target version changed from Arvados Future Sprints to 2019-03-13 Sprint