Project

General

Profile

Actions

Bug #18977

open

Mishandled Azure error?

Added by Peter Amstutz 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

It seems that it got a 502 error from Azure trying to list/create a VM, but instead of retrying it stopped doing anything? There is a container waiting to run, so it should still be trying to start a VM.

Apr 08 14:38:53 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"error":"compute.VirtualMachinesClient#List: Failure responding to request: StatusCode=502 -- Original Error: autorest/azure: Service returned an error. Status=502 Code=\"BadGatewayConnection\" Message=\"The network connectivity issue encountered for 'Microsoft.Compute'; cannot fulfill the request.\"","level":"warning","msg":"sync failed","time":"2022-04-08T14:38:53.170049232Z"}
Apr 08 14:48:17 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"Priority":1124250477607195,"State":"Queued","level":"info","msg":"adding container to queue","time":"2022-04-08T14:48:17.687511023Z"}
Apr 08 14:48:17 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T14:48:17.731312912Z"}
Apr 08 14:48:51 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Will delete compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud-nic because it is older than 20s","time":"2022-04-08T14:48:51.339506878Z"}
Apr 08 14:48:55 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Deleted NIC compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud-nic","time":"2022-04-08T14:48:55.629906837Z"}
Apr 08 14:58:29 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","InstanceType":"a1v2","PID":19803,"error":"compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=502 -- Original Error: Code=\"ResourceReadFailed\" Message=\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud?api-version=2019-07-01' failed with status 'BadGateway'.\" Target=\"compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud\"","level":"error","msg":"create failed","time":"2022-04-08T14:58:29.116278044Z"}
Apr 08 14:58:29 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T14:58:29.121954282Z"}
Apr 08 15:07:11 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"error":"compute.VirtualMachinesClient#List: Failure responding to request: StatusCode=502 -- Original Error: autorest/azure: Service returned an error. Status=502 Code=\"BadGatewayConnection\" Message=\"The network connectivity issue encountered for 'Microsoft.Compute'; cannot fulfill the request.\"","level":"warning","msg":"sync failed","time":"2022-04-08T15:07:11.834191437Z"}
Apr 08 15:07:15 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","InstanceType":"a1v2","PID":19803,"error":"compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=502 -- Original Error: Code=\"ResourceReadFailed\" Message=\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-v8bnr4q5sraw7ky?api-version=2019-07-01' failed with status 'BadGateway'.\" Target=\"compute-99cb150b26149780de44b929577e1aed-v8bnr4q5sraw7ky\"","level":"error","msg":"create failed","time":"2022-04-08T15:07:15.864958406Z"}
Apr 08 15:07:15 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T15:07:15.869242834Z"}
Apr 08 15:08:11 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Will delete compute-99cb150b26149780de44b929577e1aed-5kdp34ouw4nt6df-nic because it is older than 20s","time":"2022-04-08T15:08:11.901856344Z"}
Apr 08 15:08:16 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Deleted NIC compute-99cb150b26149780de44b929577e1aed-5kdp34ouw4nt6df-nic","time":"2022-04-08T15:08:16.315492360Z"}
api.ce8i5:~# arvados-server config-dump|grep -i create
        MaxConcurrentInstanceCreateOps: 1
Actions #1

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege 4 months ago

  • Description updated (diff)
Actions #4

Updated by Tom Clegg 4 months ago

I haven't confirmed this but ISTR the Azure SDK's "create instance" call sends a "create" request and then sits in a polling loop waiting for the instance to be created before returning to a-d-c. The 502 error might come from the polling, not the create request per se. In that case a-d-c is behaving as intended/desired here: even without the concurrent-create throttle, we don't want to ask for a 2nd instance just because the 1st one is taking a few minutes to appear in the list.

Maybe there is a way to report this so it's more obvious though -- "pending create requests" in metrics?

Actions #5

Updated by Peter Amstutz 4 months ago

  • Target version deleted (2022-04-27 Sprint)
Actions

Also available in: Atom PDF