Project

General

Profile

Actions

Idea #20517

open

Impedance mismatch using `systemctl is-system-running` as default BootProbeCommand

Added by Brett Smith over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
Story points:
-

Description

systemctl is-system-running returns nonzero and prints degraded if any services failed. This is not an uncommon state. It usually doesn't mean that the system is unusable, just that one specific service failed to start for some reason.

However, crunch uses this as the default BootProbeCommand and considers a node unusable when the command returns nonzero. This means crunch gives up on boots that are perfectly usable for its purposes, just not perfect.

We should either:

  • When the BootProbeCommand is systemctl is-system-running, parse the output line and consider degraded to be a successful boot (among others, see systemctl(1) for all possibilities).
  • Change the default BootProbeCommand to something closer to what we want. Probably the next best thing is to check the status of one or more of the special targets we consider requisite, like multi-user.target and network-online.target (see systemd.special(7)).
Actions #2

Updated by Brett Smith over 1 year ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 1 year ago

I'm not convinced running containers on "degraded" systems is a good default. Historically, narrower checks have caused problems that are much harder to track down, like #18027, where we changed the default from "docker ps" to "systemctl is-system-running" to prevent crunch-run's temp dir from vanishing mid-run.

Generally, an instance should run an image specifically designed for the purpose of running containers. If it has some services that are optional, the best solution may be to remove them. If there's a real need for a "best effort" systemd unit, there is probably a way to express it so systemd understands it's "up enough" so it doesn't cause "degraded" state. As a last resort, a custom boot probe command would work, but it seems like it would necessarily be site-specific.

I am more inclined to think of this as a documentation issue ("how to change default") and a monitoring/alerting issue ("how to detect and troubleshoot when instances aren't coming up reliably").

Actions #4

Updated by Brett Smith over 1 year ago

Tom Clegg wrote in #note-3:

Generally, an instance should run an image specifically designed for the purpose of running containers. If it has some services that are optional, the best solution may be to remove them.

I agree this is the ideal but sometimes there are conflicting requirements. The situation that inspired this ticket is a user who must run additional security services as part of company policy. One of those services is failing and leading to degraded state. That's out of Arvados' control.

If there's a real need for a "best effort" systemd unit…

One thing that I think is good to note here is there doesn't need to be a single unit. (IMO this was a problem with the previous docker ps check, it was functionally only checking one service, Docker.) As one idea, you can query multiple units with a single systemctl status command, and the exit code will tell you if they all succeeded or if any of them failed. It seems semi-doable to write a command like that that queries all the services Crunch cares about: filesystem and network and Docker and

There would be challenges to making a command like that a default. Some service names may vary by distro, and some services you may want to query depending on configuration (e.g., Slurm/LSF). But I think it's at least worth the effort to hash out what it would look like first, and then consider what to do with it once we have a better idea of how simple it is or isn't. If we can write a version that's pretty expansive and cross-distro, I'd at least want to talk about it as a possible new default.

I am more inclined to think of this as a documentation issue ("how to change default") and a monitoring/alerting issue ("how to detect and troubleshoot when instances aren't coming up reliably").

I am open to this approach too, but if we go this way I think part of the documentation change should be to elevate this issue from "this is something you can change" to "this is something you should definitely consider changing." I think this issue is difficult for new administrators to diagnose when they don't realize this configuration option is here or understand how strict the default check is. Also, new administrators are more likely to be running a generic distro/image that isn't necessarily purpose-built for compute nodes and more likely to have incidental failing services. In other words, an administrator who only knows "my compute nodes seem to boot fine but Crunch thinks they all failed" needs to be able to find that problem described and the solution in the documentation, without knowing the names of BootProbeCommand, systemctl is-system-running, etc.

Actions #5

Updated by Lucas Di Pentima over 1 year ago

For the record, this is what worked for the customer in question:

BootProbeCommand: 'systemctl status multi-user.target docker.socket'
Actions

Also available in: Atom PDF