Bug #17186
open[dispatch] crunch-run logs should be copied to a-d-c logs if crunch-run can't save a log collection
Description
Currently, when crunch-run detects a broken node, it will report that in the container logs, e.g. su92l-dz642-uyoqykf2i604pma, https://workbench.su92l.arvadosapi.com/collections/04c43ec5454a350c37c0affd7d331e63+1236/crunch-run.txt?disposition=inline&size=1358:
2020-12-01T20:35:27.009845935Z Error suggests node is unable to run containers: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 2020-12-01T20:35:27.009904337Z Writing /var/lock/crunch-run-broken to mark node as broken 2020-12-01T20:35:27.009991541Z error in Run: While loading container image: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 2020-12-01T20:35:27.073795048Z crunch-run finished
Meanwhile, the a-d-c logs don't provide any detail:
Dec 01 20:34:43 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"level":"info","msg":"crunch-run process started","time":"2020-12-01T20:34:43.820244851Z"} Dec 01 20:35:54 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"Reason":"state=Queued","level":"info","msg":"killing crunch-run process","time":"2020-12-01T20:35:54.226894349Z"}
It will be helpful for debugging to copy the broken node details into the a-d-c logs.
More generally (even when not marking a node as broken), this will be helpful whenever crunch-run can't save a collection or update the container record, e.g., a networking/firewall/API issue. Crunch-run copies all log entries to the journal prepended with container UUID and timestamp, so a-d-c could run journalctl -t crunch-run -o cat --grep ^${uuid}
on the worker and copy that to its own logs whenever a crunch-run process exits without finalizing the container.
Updated by Ward Vandewege over 4 years ago
- Related to Feature #17185: [adc] add broken node metrics added
Updated by Brett Smith 7 days ago
- Related to Idea #21581: Crunch saves compute node journals to collections readable only by administrators added
Updated by Tom Clegg 6 days ago
- Related to Feature #20220: Dispatcher uses live logs endpoint on crunch-run to fetch logs and store a backup locally added