Bug #18977
openMishandled Azure error?
Description
It seems that it got a 502 error from Azure trying to list/create a VM, but instead of retrying it stopped doing anything? There is a container waiting to run, so it should still be trying to start a VM.
Apr 08 14:38:53 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"error":"compute.VirtualMachinesClient#List: Failure responding to request: StatusCode=502 -- Original Error: autorest/azure: Service returned an error. Status=502 Code=\"BadGatewayConnection\" Message=\"The network connectivity issue encountered for 'Microsoft.Compute'; cannot fulfill the request.\"","level":"warning","msg":"sync failed","time":"2022-04-08T14:38:53.170049232Z"} Apr 08 14:48:17 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"Priority":1124250477607195,"State":"Queued","level":"info","msg":"adding container to queue","time":"2022-04-08T14:48:17.687511023Z"} Apr 08 14:48:17 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T14:48:17.731312912Z"} Apr 08 14:48:51 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Will delete compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud-nic because it is older than 20s","time":"2022-04-08T14:48:51.339506878Z"} Apr 08 14:48:55 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Deleted NIC compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud-nic","time":"2022-04-08T14:48:55.629906837Z"} Apr 08 14:58:29 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","InstanceType":"a1v2","PID":19803,"error":"compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=502 -- Original Error: Code=\"ResourceReadFailed\" Message=\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud?api-version=2019-07-01' failed with status 'BadGateway'.\" Target=\"compute-99cb150b26149780de44b929577e1aed-2go3rl5763z0uud\"","level":"error","msg":"create failed","time":"2022-04-08T14:58:29.116278044Z"} Apr 08 14:58:29 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T14:58:29.121954282Z"} Apr 08 15:07:11 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"error":"compute.VirtualMachinesClient#List: Failure responding to request: StatusCode=502 -- Original Error: autorest/azure: Service returned an error. Status=502 Code=\"BadGatewayConnection\" Message=\"The network connectivity issue encountered for 'Microsoft.Compute'; cannot fulfill the request.\"","level":"warning","msg":"sync failed","time":"2022-04-08T15:07:11.834191437Z"} Apr 08 15:07:15 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","InstanceType":"a1v2","PID":19803,"error":"compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=502 -- Original Error: Code=\"ResourceReadFailed\" Message=\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/ce8i5/providers/Microsoft.Compute/virtualMachines/compute-99cb150b26149780de44b929577e1aed-v8bnr4q5sraw7ky?api-version=2019-07-01' failed with status 'BadGateway'.\" Target=\"compute-99cb150b26149780de44b929577e1aed-v8bnr4q5sraw7ky\"","level":"error","msg":"create failed","time":"2022-04-08T15:07:15.864958406Z"} Apr 08 15:07:15 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","ContainerUUID":"ce8i5-dz642-tjbnh9m9m96ok6l","InstanceType":"a1v2","PID":19803,"level":"info","msg":"creating new instance","time":"2022-04-08T15:07:15.869242834Z"} Apr 08 15:08:11 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Will delete compute-99cb150b26149780de44b929577e1aed-5kdp34ouw4nt6df-nic because it is older than 20s","time":"2022-04-08T15:08:11.901856344Z"} Apr 08 15:08:16 api.ce8i5.arvadosapi.com arvados-dispatch-cloud[19803]: {"ClusterID":"ce8i5","PID":19803,"level":"info","msg":"Deleted NIC compute-99cb150b26149780de44b929577e1aed-5kdp34ouw4nt6df-nic","time":"2022-04-08T15:08:16.315492360Z"}
api.ce8i5:~# arvados-server config-dump|grep -i create MaxConcurrentInstanceCreateOps: 1
Updated by Tom Clegg over 2 years ago
I haven't confirmed this but ISTR the Azure SDK's "create instance" call sends a "create" request and then sits in a polling loop waiting for the instance to be created before returning to a-d-c. The 502 error might come from the polling, not the create request per se. In that case a-d-c is behaving as intended/desired here: even without the concurrent-create throttle, we don't want to ask for a 2nd instance just because the 1st one is taking a few minutes to appear in the list.
Maybe there is a way to report this so it's more obvious though -- "pending create requests" in metrics?
Updated by Peter Amstutz over 2 years ago
- Target version deleted (
2022-04-27 Sprint)