Idea #20517
openImpedance mismatch using `systemctl is-system-running` as default BootProbeCommand
Description
systemctl is-system-running
returns nonzero and prints degraded
if any services failed. This is not an uncommon state. It usually doesn't mean that the system is unusable, just that one specific service failed to start for some reason.
However, crunch uses this as the default BootProbeCommand
and considers a node unusable when the command returns nonzero. This means crunch gives up on boots that are perfectly usable for its purposes, just not perfect.
We should either:
- When the
BootProbeCommand
issystemctl is-system-running
, parse the output line and considerdegraded
to be a successful boot (among others, seesystemctl(1)
for all possibilities). - Change the default
BootProbeCommand
to something closer to what we want. Probably the next best thing is to check the status of one or more of the special targets we consider requisite, likemulti-user.target
andnetwork-online.target
(seesystemd.special(7)
).
Updated by Tom Clegg over 1 year ago
I'm not convinced running containers on "degraded" systems is a good default. Historically, narrower checks have caused problems that are much harder to track down, like #18027, where we changed the default from "docker ps" to "systemctl is-system-running" to prevent crunch-run's temp dir from vanishing mid-run.
Generally, an instance should run an image specifically designed for the purpose of running containers. If it has some services that are optional, the best solution may be to remove them. If there's a real need for a "best effort" systemd unit, there is probably a way to express it so systemd understands it's "up enough" so it doesn't cause "degraded" state. As a last resort, a custom boot probe command would work, but it seems like it would necessarily be site-specific.
I am more inclined to think of this as a documentation issue ("how to change default") and a monitoring/alerting issue ("how to detect and troubleshoot when instances aren't coming up reliably").
Updated by Brett Smith over 1 year ago
Tom Clegg wrote in #note-3:
Generally, an instance should run an image specifically designed for the purpose of running containers. If it has some services that are optional, the best solution may be to remove them.
I agree this is the ideal but sometimes there are conflicting requirements. The situation that inspired this ticket is a user who must run additional security services as part of company policy. One of those services is failing and leading to degraded state. That's out of Arvados' control.
If there's a real need for a "best effort" systemd unit…
One thing that I think is good to note here is there doesn't need to be a single unit. (IMO this was a problem with the previous docker ps
check, it was functionally only checking one service, Docker.) As one idea, you can query multiple units with a single systemctl status
command, and the exit code will tell you if they all succeeded or if any of them failed. It seems semi-doable to write a command like that that queries all the services Crunch cares about: filesystem and network and Docker and…
There would be challenges to making a command like that a default. Some service names may vary by distro, and some services you may want to query depending on configuration (e.g., Slurm/LSF). But I think it's at least worth the effort to hash out what it would look like first, and then consider what to do with it once we have a better idea of how simple it is or isn't. If we can write a version that's pretty expansive and cross-distro, I'd at least want to talk about it as a possible new default.
I am more inclined to think of this as a documentation issue ("how to change default") and a monitoring/alerting issue ("how to detect and troubleshoot when instances aren't coming up reliably").
I am open to this approach too, but if we go this way I think part of the documentation change should be to elevate this issue from "this is something you can change" to "this is something you should definitely consider changing." I think this issue is difficult for new administrators to diagnose when they don't realize this configuration option is here or understand how strict the default check is. Also, new administrators are more likely to be running a generic distro/image that isn't necessarily purpose-built for compute nodes and more likely to have incidental failing services. In other words, an administrator who only knows "my compute nodes seem to boot fine but Crunch thinks they all failed" needs to be able to find that problem described and the solution in the documentation, without knowing the names of BootProbeCommand
, systemctl is-system-running
, etc.
Updated by Lucas Di Pentima over 1 year ago
For the record, this is what worked for the customer in question:
BootProbeCommand: 'systemctl status multi-user.target docker.socket'