Bug #14920

[crunch-dispatch-cloud] New Azure instances always have state=unknown instead of state=booting

Added by Tom Clegg 8 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/07/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Currently, when a-d-c uses the Azure driver, new instances have state=unknown (instead of the expected state=booting) until the boot/run probes pass.

The "unknown" state is intended to cover the case where the "list instances" call returns a previously unseen instance ID. In the Azure case, the "create VM" call does not even return the ID of the newly created instance until the instance has finished booting, so until then, the dispatcher's worker pool doesn't recognize that it corresponds to an outstanding "create" call.

Some different ways to address this:
  • In the Azure driver, return as soon as the instance ID is known, instead of waiting for it to boot. This is how the driver is expected to work, but the Azure client library might not make it easy.
  • In the worker pool, when an unexpected instance ID appears, check whether its "secret token" tag matches an outstanding Create call. This would also cover the "list returns before create" race, which applies to all drivers.

The second option seems better.

It would also be worth documenting the expected driver behavior in the driver interface definition: Create() should generally return as soon as the new instance's ID is known, but must not return so early that a subsequent call to Instances() might not include the new instance.

14920-fixed.png (20 KB) 14920-fixed.png Tom Clegg, 03/07/2019 07:38 PM

Subtasks

Task #14928: review 14920-unknown-booting-raceResolvedWard Vandewege

Associated revisions

Revision 64e72e28
Added by Tom Clegg 7 months ago

Merge branch '14920-unknown-booting-race'

fixes #14920

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 8 months ago

  • Category set to Crunch
  • Target version set to To Be Groomed

#2 Updated by Tom Clegg 8 months ago

  • Description updated (diff)

#3 Updated by Tom Clegg 8 months ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version changed from To Be Groomed to Arvados Future Sprints

14920-unknown-booting-race @ e49978c5d9bece2a1db646f36cdf346414dd8813

#4 Updated by Ward Vandewege 7 months ago

Tom Clegg wrote:

14920-unknown-booting-race @ e49978c5d9bece2a1db646f36cdf346414dd8813

LGTM. I like that the code is a lot more elegant now!

#5 Updated by Tom Clegg 7 months ago

I noticed while testing that metrics didn't reflect the idle→running change right away when starting a container. With that fixed:

#6 Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved

#7 Updated by Tom Morris 7 months ago

  • Target version changed from Arvados Future Sprints to 2019-03-13 Sprint

#8 Updated by Tom Morris 5 months ago

  • Release set to 15

Also available in: Atom PDF