Project

General

Profile

Actions

Bug #17186

open

[dispatch] broken node logs should also be copied to a-d-c logs

Added by Ward Vandewege over 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Currently, when crunch-run detects a broken node, it will report that in the container logs, e.g. su92l-dz642-uyoqykf2i604pma, https://workbench.su92l.arvadosapi.com/collections/04c43ec5454a350c37c0affd7d331e63+1236/crunch-run.txt?disposition=inline&size=1358:

2020-12-01T20:35:27.009845935Z Error suggests node is unable to run containers: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2020-12-01T20:35:27.009904337Z Writing /var/lock/crunch-run-broken to mark node as broken
2020-12-01T20:35:27.009991541Z error in Run: While loading container image: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2020-12-01T20:35:27.073795048Z crunch-run finished

Meanwhile, the a-d-c logs don't provide any detail:

Dec 01 20:34:43 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"level":"info","msg":"crunch-run process started","time":"2020-12-01T20:34:43.820244851Z"}
Dec 01 20:35:54 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"Reason":"state=Queued","level":"info","msg":"killing crunch-run process","time":"2020-12-01T20:35:54.226894349Z"}

It will be helpful for debugging to copy the broken node details into the a-d-c logs.


Related issues

Related to Arvados - Feature #17185: [adc] add broken node metricsNewTom Clegg

Actions
Actions #1

Updated by Ward Vandewege over 1 year ago

Actions

Also available in: Atom PDF