Tom Clegg wrote in #note-3:
Generally, an instance should run an image specifically designed for the purpose of running containers. If it has some services that are optional, the best solution may be to remove them.
I agree this is the ideal but sometimes there are conflicting requirements. The situation that inspired this ticket is a user who must run additional security services as part of company policy. One of those services is failing and leading to degraded state. That's out of Arvados' control.
If there's a real need for a "best effort" systemd unit…
One thing that I think is good to note here is there doesn't need to be a single unit. (IMO this was a problem with the previous docker ps
check, it was functionally only checking one service, Docker.) As one idea, you can query multiple units with a single systemctl status
command, and the exit code will tell you if they all succeeded or if any of them failed. It seems semi-doable to write a command like that that queries all the services Crunch cares about: filesystem and network and Docker and…
There would be challenges to making a command like that a default. Some service names may vary by distro, and some services you may want to query depending on configuration (e.g., Slurm/LSF). But I think it's at least worth the effort to hash out what it would look like first, and then consider what to do with it once we have a better idea of how simple it is or isn't. If we can write a version that's pretty expansive and cross-distro, I'd at least want to talk about it as a possible new default.
I am more inclined to think of this as a documentation issue ("how to change default") and a monitoring/alerting issue ("how to detect and troubleshoot when instances aren't coming up reliably").
I am open to this approach too, but if we go this way I think part of the documentation change should be to elevate this issue from "this is something you can change" to "this is something you should definitely consider changing." I think this issue is difficult for new administrators to diagnose when they don't realize this configuration option is here or understand how strict the default check is. Also, new administrators are more likely to be running a generic distro/image that isn't necessarily purpose-built for compute nodes and more likely to have incidental failing services. In other words, an administrator who only knows "my compute nodes seem to boot fine but Crunch thinks they all failed" needs to be able to find that problem described and the solution in the documentation, without knowing the names of BootProbeCommand
, systemctl is-system-running
, etc.