Project

General

Profile

Actions

Bug #20649

closed

Improve troubleshooting assistance for compute instance SSH problems

Added by Tom Clegg about 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0
Release relationship:
Auto

Description

Any number of setup mistakes/problems can result in arvados-dispatch-cloud being unable to authenticate to a new cloud VM after successfully creating it. Currently, when this is happening, arvados does not give good debugging clues.

Specific improvements that would help:
  • arvados-server cloudtest should obey DeployPublicKey flag, so that it does the same thing as a-d-c.
  • SSH authentication errors should be logged right away (most boot probe failures are "can't connect to SSH port" or "probe command failed because system is still booting" and lots of these are expected in normal operation, which is why they are suppressed until boot timeout -- but an SSH authentication problem is not expected in normal operation)
  • When timing out on boot probe, a-d-c should log the last error, not just the stderr from the last attempt (which is empty in this case)
  • When timing out on boot probe, a-d-c's log message should remind the operator that "arvados-server cloudtest" is available to help troubleshoot.

Subtasks 1 (0 open1 closed)

Task #20815: Review 20649-ssh-helpResolvedTom Clegg08/14/2023Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Target version changed from To be scheduled to Development 2023-08-16
Actions #2

Updated by Peter Amstutz 11 months ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg 10 months ago

20649-ssh-help @ 471f5c792f4bf06ca4fab1600074c8c335b7d70f -- developer-run-tests: #3774

Branch contains the 4 features described above, plus
  • cherry-picked unmerged c9e7b20bf from #20457#note-59 to fix occasional test failure
  • e44725a37 to fix occasional test panic that could also potentially happen IRL
  • b38ee1ddd to eliminate unnecessary delay between deciding to kill a container and sending the first TERM signal
Actions #4

Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress
Actions #5

Updated by Lucas Di Pentima 10 months ago

This LGTM, thanks!

Actions #6

Updated by Tom Clegg 10 months ago

  • Status changed from In Progress to Resolved
Actions #7

Updated by Peter Amstutz 10 months ago

  • Release set to 66
Actions

Also available in: Atom PDF