Bug #16795

[a-d-c] flaky test

Added by Tom Clegg about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/02/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://ci.arvados.org/job/run-tests-remainder/3934/consoleText

dispatcher_test.go:212:
    c.Check(resp.Body.String(), check.Matches, `(?ms).*boot_outcomes{outcome="aborted"} 0.*`)
...
...     "arvados_dispatchcloud_boot_outcomes{outcome=\"aborted\"} 15\n" +

Subtasks

Task #16798: Review 16795-boot-outcome-abortedResolvedWard Vandewege


Related issues

Related to Arvados - Feature #15888: Update run-tests.sh to use python 3 Resolved

Associated revisions

Revision c1bd1ee9
Added by Tom Clegg about 1 year ago

Merge branch '16795-boot-outcome-aborted'

fixes #16795

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg about 1 year ago

I think this started in #16739 when we fixed (*worker.Pool)Create() to return false when a rate-limit hold is in effect.
  • In the simulation test, the stub driver returns a rate-limit error when it gets two Create() calls within 1ms (MinTimeBetweenInstancesCalls).
  • When this happens, (*Pool)Create() calls wp.instanceSet.throttleCreate.CheckRateLimitError()
  • The next call to Create() (if it happens in the same 1ms window) returns false because the rate-limit hold is still in effect.
  • When Create() returns false, the scheduler invokes its quota-avoidance strategy: avoid starting any lower-priority containers, and shut down any idle/booting nodes that aren't needed for higher-priority containers.
  • When rescheduling a container due to a simulated error or instance failure, there's likely to be a booting instance whose type only matches lower-priority containers, requested back when the previous attempt on the higher-priority container hadn't failed yet.
  • When this happens, the booting instances get shut down, and end up with boot outcome "aborted".

Solution: Stop invoking the quota-avoidance strategy when pool.Create() returns false but pool.AtQuota() is false. (When cloud providers/drivers have been returning errors other than quota/rate-limit, Create() returns true and tries creating instances asynchronously, so this change only affects the rate-limiting case.)

#2 Updated by Tom Clegg about 1 year ago

16795-boot-outcome-aborted @ b30659d514ce281209fa7b99863413832fa8d44b -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2063/

(also includes a fix for a race in the stub driver that caused occasional test failures)

#3 Updated by Tom Clegg about 1 year ago

16795-boot-outcome-aborted @ ac4599592d265dc5a922ec8f468d46cfe7de52e2 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2064/

(pinned pycurl to avoid the latest py3-only versions)

#4 Updated by Ward Vandewege about 1 year ago

Tom Clegg wrote:

16795-boot-outcome-aborted @ ac4599592d265dc5a922ec8f468d46cfe7de52e2 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2064/

(pinned pycurl to avoid the latest py3-only versions)

LGTM, thanks for the detailed explanation in note 1. The pycurl pin should be removed as part of #15888.

#5 Updated by Ward Vandewege about 1 year ago

  • Related to Feature #15888: Update run-tests.sh to use python 3 added

#6 Updated by Anonymous about 1 year ago

  • Status changed from In Progress to Resolved

#7 Updated by Peter Amstutz about 1 year ago

  • Release set to 25

Also available in: Atom PDF