Feature #21079


When at cloud quota, retry creating instances periodically even when none have shut down

Added by Tom Clegg 9 months ago. Updated 5 months ago.

Assigned To:
Target version:
Story points:


From #20984#note-8:

The current behavior has some very bad failure modes. A user launched a pipeline which asked for a large node (m4.10xlarge) and got InsufficientInstanceCapacity after only 3 instances had been created; this caused the dispatcher to completely stop trying to start nodes and lowered the dynamic max instances down to 3. As a result it became starved because the instances already running were waiting on the worker instance to start, but dispatcher was waiting for an instance to shut down before it would try starting a new one.

Instead of going completely silent on quota error, I think we want to either go back to the old behavior (1 minute quiet period) or implement an exponential back off behavior (wait for 15 seconds, then 30 seconds, then 60 seconds, then 2 minutes, etc). An instance shutdown can still be used as a signal to try starting a new instance if it is in the quiet period, but a quiet period of indefinite length is turning out to be bad behavior -- the correct assumption is that we're sharing the cloud resource with other users and new resources could become available any time without us having to do anything.

Even after #20984 is fixed, a similar situation can still happen with conditions like InsufficientFreeAddressesInSubnet: if the relevant resources are freed up by something other than arvados-dispatch-cloud (or the relevant quota is increased), the current implementation will not notice until an existing instance gets shut down.

To address this, the quota flag should get reset after some time interval (1 minute?) even if no instances have been shut down.

Part of the original motivation for latching the quota flag was to avoid exhausting lock/unlock cycles. When changing this, make sure the fix in #20457 (don't unlock the next locked container just because cloud is at quota) is still effective.


Also available in: Atom PDF