Project

General

Profile

Actions

Bug #18102

closed

max dispatch attempts error

Added by Jiayong Li over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-vzc0tvmwarbfnx1

Error: Failed to start container. Cancelled after exceeding 'Containers.MaxDispatchAttempts' (lock_count=5)

Subtasks 1 (0 open1 closed)

Task #18111: Review 18102-max-dispatch-attemptsResolvedTom Clegg09/07/2021Actions

Related issues

Related to Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given containerResolvedPeter Amstutz04/26/2017Actions
Related to Arvados - Feature #18113: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOpsResolvedWard Vandewege10/05/2021Actions
Related to Arvados - Bug #18114: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errorsNewActions
Actions #1

Updated by Jiayong Li over 2 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Jiayong Li over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 2 years ago

example of affected container: 2xpu4-dz642-3duf09jaxd3b9gv

{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"Priority":562948322809997302,"State":"Queued","level":"info","msg":"adding container to queue","time":"2021-09-02T20:20:53.543265064Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"level":"info","msg":"creating new instance","time":"2021-09-02T20:21:51.709046394Z"}
{"InstanceType":"m58xlarge.spot","PID":23276,"error":"InsufficientInstanceCapacity: We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.\n\tstatus code: 500, request id: 1a987cc4-80fc-4105-85bb-2d25fdf8dbbd","level":"error","msg":"create failed","time":"2021-09-02T20:21:52.628528300Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-a9h7sg2h1e1u11ity3xs)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.392472387Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-14br8w8ut0nf9nek9xex)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.974055346Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Cancelled\u003e (req-4f9hcjkzs9x61409czrt)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:56.305846772Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"container finished -- dropping from queue","time":"2021-09-02T20:21:56.305932024Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2021-09-02T20:21:56.305946078Z"}
Actions #4

Updated by Tom Clegg over 2 years ago

  • Target version set to 2021-09-15 sprint
Actions #5

Updated by Tom Clegg over 2 years ago

Flapping lock/unlock calls

The scheduler assumed, incorrectly, that sorting containers by priority would give consistent results from one iteration to the next. This is not true when multiple containers have identical priority, which is how we do child containers. As a result, in a situation where the scheduler is waiting for new instances to come up and knows that there are more containers than new instances, the scheduler keeps changing its mind about which containers are "first" and should be locked, and which should be unlocked because they won't fit on the new instances.

Fix by tracking FirstSeenAt timestamp on each queue entry, and using that to sort identical-priority containers.

"Invalid state transition" errors

In runQueue, when ctr.State is Locked and sch.pool.AtQuota() is true, we call sch.queue.Unlock(), but we also include the current container in the overquota slice, which means we call sch.queue.Unlock() on it again in the "unlock any containers ..." block below. Fix by removing the first call to sch.queue.Unlock().

18102-max-dispatch-attempts @ 6ba70e34ec9bc9941e8c04d6654106e6d1bbae82 -- developer-run-tests: #2678

extra thoughts

Rate limiting:

We're getting some rate-limiting errors (RequestLimitExceeded) from AWS. Not clear whether it's caused by having additional create requests in progress when the first one hits InsufficientInstanceCapacity, or just by creating several dozen instances at once.

Either way, these configs should probably not be zero/unlimited. Perhaps we should update the defaults to something reasonable.

        MaxCloudOpsPerSecond: 0
        MaxConcurrentInstanceCreateOps: 0

MaxConcurrentInstanceCreateOps was invented to accommodate Azure but it seems useful for AWS too.

Error handling:

If we get an error from the cloud provider when trying to create an instance, but the error isn't recognized as a quota or rate-limiting error, we retry very quickly, which would be unhelpful for an error like "invalid instance type". I don't think this is happening here (InsufficientInstanceCapacity is recognized as a quota error) but we might want to consider other options, like a per-instance-type quiet period for unrecognized errors (which would probably be a better way to respond to InsufficientInstanceCapacity as well).

Actions #6

Updated by Tom Clegg over 2 years ago

  • Related to Bug #11561: [API] Limit number of lock/unlock cycles for a given container added
Actions #8

Updated by Lucas Di Pentima over 2 years ago

Just one small question:

  • At file lib/dispatchcloud/scheduler/sync.go
    • Wouldn't the changes at lines 87-93 avoid almost completely the logging of unexpected state? Maybe we could log the message just once per container?

The rest LGTM, including your extra thoughts, although maybe we want to discuss them with the rest of the team?

Actions #9

Updated by Tom Clegg over 2 years ago

  • Related to Feature #18113: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOps added
Actions #10

Updated by Tom Clegg over 2 years ago

  • Related to Bug #18114: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errors added
Actions #11

Updated by Tom Clegg over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|89bdd6bf07147ad707125fc3882caebb5402ab09.

Actions #12

Updated by Tom Clegg over 2 years ago

  • Release set to 43
Actions #13

Updated by Lucas Di Pentima over 2 years ago

  • Release changed from 43 to 42
Actions

Also available in: Atom PDF