Bug #16795
[a-d-c] flaky test
100%
Description
run-tests-remainder: #3934 /consoleText
dispatcher_test.go:212: c.Check(resp.Body.String(), check.Matches, `(?ms).*boot_outcomes{outcome="aborted"} 0.*`) ... ... "arvados_dispatchcloud_boot_outcomes{outcome=\"aborted\"} 15\n" +
Subtasks
Related issues
Associated revisions
History
#1
Updated by Tom Clegg over 1 year ago
(*worker.Pool)Create()
to return false when a rate-limit hold is in effect.
- In the simulation test, the stub driver returns a rate-limit error when it gets two Create() calls within 1ms (MinTimeBetweenInstancesCalls).
- When this happens, (*Pool)Create() calls wp.instanceSet.throttleCreate.CheckRateLimitError()
- The next call to Create() (if it happens in the same 1ms window) returns false because the rate-limit hold is still in effect.
- When Create() returns false, the scheduler invokes its quota-avoidance strategy: avoid starting any lower-priority containers, and shut down any idle/booting nodes that aren't needed for higher-priority containers.
- When rescheduling a container due to a simulated error or instance failure, there's likely to be a booting instance whose type only matches lower-priority containers, requested back when the previous attempt on the higher-priority container hadn't failed yet.
- When this happens, the booting instances get shut down, and end up with boot outcome "aborted".
Solution: Stop invoking the quota-avoidance strategy when pool.Create() returns false but pool.AtQuota() is false. (When cloud providers/drivers have been returning errors other than quota/rate-limit, Create() returns true and tries creating instances asynchronously, so this change only affects the rate-limiting case.)
#2
Updated by Tom Clegg over 1 year ago
16795-boot-outcome-aborted @ b30659d514ce281209fa7b99863413832fa8d44b -- developer-run-tests: #2063
(also includes a fix for a race in the stub driver that caused occasional test failures)
#3
Updated by Tom Clegg over 1 year ago
16795-boot-outcome-aborted @ ac4599592d265dc5a922ec8f468d46cfe7de52e2 -- developer-run-tests: #2064
(pinned pycurl to avoid the latest py3-only versions)
#4
Updated by Ward Vandewege over 1 year ago
Tom Clegg wrote:
16795-boot-outcome-aborted @ ac4599592d265dc5a922ec8f468d46cfe7de52e2 -- developer-run-tests: #2064
(pinned pycurl to avoid the latest py3-only versions)
LGTM, thanks for the detailed explanation in note 1. The pycurl pin should be removed as part of #15888.
#5
Updated by Ward Vandewege over 1 year ago
- Related to Feature #15888: Update run-tests.sh to use python 3 added
#6
Updated by Anonymous over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|c1bd1ee9ed5c36a3af524178e876a9b2255ab5f0.
#7
Updated by Peter Amstutz over 1 year ago
- Release set to 25
Merge branch '16795-boot-outcome-aborted'
fixes #16795
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@tomclegg.ca>