Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning
- Status changed from New to In Progress
The diagnostics command incorporates a health-check, so in that sense it's included.We could also add:
- a page (in the Admin>Monitoring section?) about the diagnostics command, similar to the "Testing cloud configuration" page
- a bit about the "arvados-server health" command on the Admin > Monitoring > Health checks page
Updated by Lucas Di Pentima 2 months ago
Just one comment:
doc/admin/diagnostics.html.textile.liquidI think the "using" word could be dropped at "...you can also run diagnostics using by setting the usual..."
The rest LGTM, thanks.
On a "diagnostics"-related note: Do you think it's a good idea to make the diagnostics tool to cancel the test container request when the 10min timeout passes and nothing happened? In my terraform adventures it happened that an instance is launched but then it couldn't run anything so the diag tool fails at the 10 minute mark and the instance isn't destroyed unless I manually cancelled the CR in workbench (maybe the dispatcher will eventually kill it?)
Removed extra "using" word in docs.
"Cancel CR if it doesn't finish" is a great idea -- added.
(In the particular case you described I think MaxDispatchAttempts would have canceled it eventually -- but either way, it seems like there's no reason for the dispatcher to keep trying once diagnostics has stopped paying attention.)