Bug #18102

max dispatch attempts error

Added by Jiayong Li about 2 months ago. Updated 1 day ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/07/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://workbench.2xpu4.arvadosapi.com/container_requests/2xpu4-xvhdp-vzc0tvmwarbfnx1

Error: Failed to start container. Cancelled after exceeding 'Containers.MaxDispatchAttempts' (lock_count=5)

Subtasks

Task #18111: Review 18102-max-dispatch-attemptsResolvedTom Clegg


Related issues

Related to Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given containerResolved04/26/2017

Related to Arvados - Feature #18113: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOpsResolved10/05/2021

Related to Arvados - Bug #18114: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errorsNew

History

#1 Updated by Jiayong Li about 2 months ago

  • Status changed from New to In Progress

#2 Updated by Jiayong Li about 2 months ago

  • Description updated (diff)

#3 Updated by Tom Clegg about 2 months ago

example of affected container: 2xpu4-dz642-3duf09jaxd3b9gv

{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"Priority":562948322809997302,"State":"Queued","level":"info","msg":"adding container to queue","time":"2021-09-02T20:20:53.543265064Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","InstanceType":"m58xlarge.spot","PID":23276,"level":"info","msg":"creating new instance","time":"2021-09-02T20:21:51.709046394Z"}
{"InstanceType":"m58xlarge.spot","PID":23276,"error":"InsufficientInstanceCapacity: We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1c, us-east-1d, us-east-1f.\n\tstatus code: 500, request id: 1a987cc4-80fc-4105-85bb-2d25fdf8dbbd","level":"error","msg":"create failed","time":"2021-09-02T20:21:52.628528300Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-a9h7sg2h1e1u11ity3xs)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.392472387Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Queued\u003e (req-14br8w8ut0nf9nek9xex)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:53.974055346Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"error":"request failed: https://2xpu4.arvadosapi.com/arvados/v1/containers/2xpu4-dz642-3duf09jaxd3b9gv/unlock: 422 Unprocessable Entity: #\u003cArvadosModel::InvalidStateTransitionError: cannot unlock when Cancelled\u003e (req-4f9hcjkzs9x61409czrt)","level":"warning","msg":"error unlocking","time":"2021-09-02T20:21:56.305846772Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"container finished -- dropping from queue","time":"2021-09-02T20:21:56.305932024Z"}
{"ContainerUUID":"2xpu4-dz642-3duf09jaxd3b9gv","PID":23276,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2021-09-02T20:21:56.305946078Z"}

#4 Updated by Tom Clegg about 2 months ago

  • Target version set to 2021-09-15 sprint

#5 Updated by Tom Clegg about 2 months ago

Flapping lock/unlock calls

The scheduler assumed, incorrectly, that sorting containers by priority would give consistent results from one iteration to the next. This is not true when multiple containers have identical priority, which is how we do child containers. As a result, in a situation where the scheduler is waiting for new instances to come up and knows that there are more containers than new instances, the scheduler keeps changing its mind about which containers are "first" and should be locked, and which should be unlocked because they won't fit on the new instances.

Fix by tracking FirstSeenAt timestamp on each queue entry, and using that to sort identical-priority containers.

"Invalid state transition" errors

In runQueue, when ctr.State is Locked and sch.pool.AtQuota() is true, we call sch.queue.Unlock(), but we also include the current container in the overquota slice, which means we call sch.queue.Unlock() on it again in the "unlock any containers ..." block below. Fix by removing the first call to sch.queue.Unlock().

18102-max-dispatch-attempts @ 6ba70e34ec9bc9941e8c04d6654106e6d1bbae82 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2678/

extra thoughts

Rate limiting:

We're getting some rate-limiting errors (RequestLimitExceeded) from AWS. Not clear whether it's caused by having additional create requests in progress when the first one hits InsufficientInstanceCapacity, or just by creating several dozen instances at once.

Either way, these configs should probably not be zero/unlimited. Perhaps we should update the defaults to something reasonable.

        MaxCloudOpsPerSecond: 0
        MaxConcurrentInstanceCreateOps: 0

MaxConcurrentInstanceCreateOps was invented to accommodate Azure but it seems useful for AWS too.

Error handling:

If we get an error from the cloud provider when trying to create an instance, but the error isn't recognized as a quota or rate-limiting error, we retry very quickly, which would be unhelpful for an error like "invalid instance type". I don't think this is happening here (InsufficientInstanceCapacity is recognized as a quota error) but we might want to consider other options, like a per-instance-type quiet period for unrecognized errors (which would probably be a better way to respond to InsufficientInstanceCapacity as well).

#6 Updated by Tom Clegg about 2 months ago

  • Related to Bug #11561: [API] Limit number of lock/unlock cycles for a given container added

#8 Updated by Lucas Di Pentima about 2 months ago

Just one small question:

  • At file lib/dispatchcloud/scheduler/sync.go
    • Wouldn't the changes at lines 87-93 avoid almost completely the logging of unexpected state? Maybe we could log the message just once per container?

The rest LGTM, including your extra thoughts, although maybe we want to discuss them with the rest of the team?

#9 Updated by Tom Clegg about 2 months ago

  • Related to Feature #18113: [a-d-c] non-zero defaults for MaxCloudOpsPerSecond and MaxConcurrentInstanceCreateOps added

#10 Updated by Tom Clegg about 2 months ago

  • Related to Bug #18114: [a-d-c] slow down retries when CreateInstance returns non-quota/non-throttle errors added

#11 Updated by Tom Clegg about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|89bdd6bf07147ad707125fc3882caebb5402ab09.

#12 Updated by Tom Clegg about 1 month ago

  • Release set to 43

#13 Updated by Lucas Di Pentima 1 day ago

  • Release changed from 43 to 42

Also available in: Atom PDF