Feature #21079
openWhen at cloud quota, retry creating instances periodically even when none have shut down
Description
From #20984#note-8:
The current behavior has some very bad failure modes. A user launched a pipeline which asked for a large node (m4.10xlarge) and got InsufficientInstanceCapacity after only 3 instances had been created; this caused the dispatcher to completely stop trying to start nodes and lowered the dynamic max instances down to 3. As a result it became starved because the instances already running were waiting on the worker instance to start, but dispatcher was waiting for an instance to shut down before it would try starting a new one.
Instead of going completely silent on quota error, I think we want to either go back to the old behavior (1 minute quiet period) or implement an exponential back off behavior (wait for 15 seconds, then 30 seconds, then 60 seconds, then 2 minutes, etc). An instance shutdown can still be used as a signal to try starting a new instance if it is in the quiet period, but a quiet period of indefinite length is turning out to be bad behavior -- the correct assumption is that we're sharing the cloud resource with other users and new resources could become available any time without us having to do anything.
Even after #20984 is fixed, a similar situation can still happen with conditions like InsufficientFreeAddressesInSubnet: if the relevant resources are freed up by something other than arvados-dispatch-cloud (or the relevant quota is increased), the current implementation will not notice until an existing instance gets shut down.
To address this, the quota flag should get reset after some time interval (1 minute?) even if no instances have been shut down.
Part of the original motivation for latching the quota flag was to avoid exhausting lock/unlock cycles. When changing this, make sure the fix in #20457 (don't unlock the next locked container just because cloud is at quota) is still effective.
Updated by Peter Amstutz 11 months ago
I don't know if this is so much of an issue any more, at the time the issue was instance type not available errors which we handle much better now.
Updated by Peter Amstutz 9 months ago
- Target version changed from To be scheduled to Future