Project

General

Profile

Actions

Bug #17186

open

[dispatch] crunch-run logs should be copied to a-d-c logs if crunch-run can't save a log collection

Added by Ward Vandewege over 4 years ago. Updated 6 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-

Description

Currently, when crunch-run detects a broken node, it will report that in the container logs, e.g. su92l-dz642-uyoqykf2i604pma, https://workbench.su92l.arvadosapi.com/collections/04c43ec5454a350c37c0affd7d331e63+1236/crunch-run.txt?disposition=inline&size=1358:

2020-12-01T20:35:27.009845935Z Error suggests node is unable to run containers: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2020-12-01T20:35:27.009904337Z Writing /var/lock/crunch-run-broken to mark node as broken
2020-12-01T20:35:27.009991541Z error in Run: While loading container image: While loading container image into Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2020-12-01T20:35:27.073795048Z crunch-run finished

Meanwhile, the a-d-c logs don't provide any detail:

Dec 01 20:34:43 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"level":"info","msg":"crunch-run process started","time":"2020-12-01T20:34:43.820244851Z"}
Dec 01 20:35:54 su92l.arvadosapi.com arvados-dispatch-cloud[120842]: {"Address":"10.28.64.31","ContainerUUID":"su92l-dz642-n9nu9htkcj4ofp6","Instance":"/subscriptions/3fa048dc-aa38-4820-85ba-68498da5f26b/resourceGroups/su92l/providers/Microsoft.Compute/virtualMachines/compute-f51710e302afe4aef4a97c634a7c2ed3-tyxs4w1m1dwwkfj","InstanceType":"Standard_D32s_v3","PID":120842,"Reason":"state=Queued","level":"info","msg":"killing crunch-run process","time":"2020-12-01T20:35:54.226894349Z"}

It will be helpful for debugging to copy the broken node details into the a-d-c logs.

More generally (even when not marking a node as broken), this will be helpful whenever crunch-run can't save a collection or update the container record, e.g., a networking/firewall/API issue. Crunch-run copies all log entries to the journal prepended with container UUID and timestamp, so a-d-c could run journalctl -t crunch-run -o cat --grep ^${uuid} on the worker and copy that to its own logs whenever a crunch-run process exits without finalizing the container.


Related issues 3 (3 open0 closed)

Related to Arvados - Feature #17185: [adc] add broken node metricsNewTom CleggActions
Related to Arvados - Idea #21581: Crunch saves compute node journals to collections readable only by administratorsNewActions
Related to Arvados - Feature #20220: Dispatcher uses live logs endpoint on crunch-run to fetch logs and store a backup locallyNewActions
Actions #1

Updated by Ward Vandewege over 4 years ago

Actions #2

Updated by Peter Amstutz about 2 years ago

  • Release set to 60
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Target version set to Future
Actions #4

Updated by Brett Smith 7 days ago

  • Related to Idea #21581: Crunch saves compute node journals to collections readable only by administrators added
Actions #5

Updated by Brett Smith 7 days ago

  • Release deleted (60)
Actions #6

Updated by Tom Clegg 6 days ago

  • Related to Feature #20220: Dispatcher uses live logs endpoint on crunch-run to fetch logs and store a backup locally added
Actions #7

Updated by Tom Clegg 6 days ago

  • Description updated (diff)
  • Subject changed from [dispatch] broken node logs should also be copied to a-d-c logs to [dispatch] crunch-run logs should be copied to a-d-c logs if crunch-run can't save a log collection
Actions

Also available in: Atom PDF