Idea #19364
closedDocument use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Updated by Peter Amstutz about 2 years ago
- Related to Bug #19215: improve multi-node installer experience added
Updated by Peter Amstutz about 2 years ago
- Related to Feature #19377: [diagnostics] show health-check response added
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Updated by Tom Clegg about 2 years ago
- Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Updated by Tom Clegg about 2 years ago
- Status changed from New to In Progress
19364-diag-docs @ 4d0ab09acfd9aed9c4b2cf6c1a85a9538e9c969d -- developer-run-tests: #3372
- add "run diagnostics" to the upgrade instructions
- replace "submit a container request using 'arv'" with "run diagnostics" on the dispatch-cloud/lsf/slurm install pages
Updated by Lucas Di Pentima about 2 years ago
19364-diag-docs
LGTM if it's only intended to document the use of diagnostics.
Are the other checks listed on the title already documented, or will be on other branches?
Updated by Tom Clegg about 2 years ago
The diagnostics command incorporates a health-check, so in that sense it's included.
We could also add:- a page (in the Admin>Monitoring section?) about the diagnostics command, similar to the "Testing cloud configuration" page
- a bit about the "arvados-server health" command on the Admin > Monitoring > Health checks page
Updated by Tom Clegg about 2 years ago
- add Admin > Monitoring > Diagnostics page
- add "arvados-server check" to the Admin > Monitoring > Health checks page
Updated by Tom Clegg about 2 years ago
merged main to get lib/pam testing fix.
19364-diag-docs @ 531fd553a1b83c546066c1d2a2619f86e17b6d20 -- developer-run-tests: #3385
Updated by Lucas Di Pentima about 2 years ago
Just one comment:
- At
doc/admin/diagnostics.html.textile.liquid
I think the "using" word could be dropped at "...you can also run diagnostics using by setting the usual..."
The rest LGTM, thanks.
On a "diagnostics"-related note: Do you think it's a good idea to make the diagnostics tool to cancel the test container request when the 10min timeout passes and nothing happened? In my terraform adventures it happened that an instance is launched but then it couldn't run anything so the diag tool fails at the 10 minute mark and the instance isn't destroyed unless I manually cancelled the CR in workbench (maybe the dispatcher will eventually kill it?)
Updated by Tom Clegg about 2 years ago
Removed extra "using" word in docs.
"Cancel CR if it doesn't finish" is a great idea -- added.
(In the particular case you described I think MaxDispatchAttempts would have canceled it eventually -- but either way, it seems like there's no reason for the dispatcher to keep trying once diagnostics has stopped paying attention.)
19364-diag-docs @ 273d4dda75bad4b1ba18bc3616f16082b95c0467 -- developer-run-tests: #3390
Updated by Tom Clegg about 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|9b976ee4d3e0c58eaa81f28f13dc4d112dbf804b.