Project

General

Profile

Actions

Story #19364

closed

Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning

Added by Peter Amstutz 6 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Documentation
Target version:
Start date:
10/12/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Subtasks 2 (0 open2 closed)

Task #19606: Review 19364-diag-docsResolvedLucas Di Pentima11/15/2022

Actions
Task #19607: ReviewClosed10/12/2022

Actions

Related issues

Related to Arvados - Bug #19215: improve multi-node installer experienceResolvedPeter Amstutz06/28/2022

Actions
Related to Arvados - Feature #19377: [diagnostics] show health-check responseResolvedTom Clegg10/05/2022

Actions
Actions #1

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #2

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #3

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #4

Updated by Peter Amstutz 4 months ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #5

Updated by Peter Amstutz 4 months ago

  • Related to Bug #19215: improve multi-node installer experience added
Actions #6

Updated by Peter Amstutz 4 months ago

  • Related to Feature #19377: [diagnostics] show health-check response added
Actions #7

Updated by Peter Amstutz 4 months ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #9

Updated by Tom Clegg 3 months ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #10

Updated by Tom Clegg 2 months ago

  • Status changed from New to In Progress

19364-diag-docs @ 4d0ab09acfd9aed9c4b2cf6c1a85a9538e9c969d -- developer-run-tests: #3372

  • add "run diagnostics" to the upgrade instructions
  • replace "submit a container request using 'arv'" with "run diagnostics" on the dispatch-cloud/lsf/slurm install pages
Actions #11

Updated by Lucas Di Pentima 2 months ago

19364-diag-docs LGTM if it's only intended to document the use of diagnostics.

Are the other checks listed on the title already documented, or will be on other branches?

Actions #12

Updated by Tom Clegg 2 months ago

The diagnostics command incorporates a health-check, so in that sense it's included.

We could also add:
  • a page (in the Admin>Monitoring section?) about the diagnostics command, similar to the "Testing cloud configuration" page
  • a bit about the "arvados-server health" command on the Admin > Monitoring > Health checks page
Actions #13

Updated by Tom Clegg 2 months ago

19364-diag-docs @ 75d0bce4f378efc488b67b178ace50301f9ad8ff -- developer-run-tests: #3384
  • add Admin > Monitoring > Diagnostics page
  • add "arvados-server check" to the Admin > Monitoring > Health checks page
Actions #14

Updated by Tom Clegg 2 months ago

merged main to get lib/pam testing fix.

19364-diag-docs @ 531fd553a1b83c546066c1d2a2619f86e17b6d20 -- developer-run-tests: #3385

Actions #15

Updated by Lucas Di Pentima 2 months ago

Just one comment:

  • At doc/admin/diagnostics.html.textile.liquid I think the "using" word could be dropped at "...you can also run diagnostics using by setting the usual..."

The rest LGTM, thanks.

On a "diagnostics"-related note: Do you think it's a good idea to make the diagnostics tool to cancel the test container request when the 10min timeout passes and nothing happened? In my terraform adventures it happened that an instance is launched but then it couldn't run anything so the diag tool fails at the 10 minute mark and the instance isn't destroyed unless I manually cancelled the CR in workbench (maybe the dispatcher will eventually kill it?)

Actions #16

Updated by Tom Clegg 2 months ago

Removed extra "using" word in docs.

"Cancel CR if it doesn't finish" is a great idea -- added.

(In the particular case you described I think MaxDispatchAttempts would have canceled it eventually -- but either way, it seems like there's no reason for the dispatcher to keep trying once diagnostics has stopped paying attention.)

19364-diag-docs @ 273d4dda75bad4b1ba18bc3616f16082b95c0467 -- developer-run-tests: #3390

Actions #17

Updated by Lucas Di Pentima 2 months ago

Great! this LGTM, thank you.

Actions #18

Updated by Tom Clegg 2 months ago

  • Status changed from In Progress to Resolved
Actions #19

Updated by Peter Amstutz about 1 month ago

  • Release set to 47
Actions

Also available in: Atom PDF