Actions
Idea #21581
openCrunch saves compute node journals to collections readable only by administrators
Start date:
Due date:
Story points:
-
Description
Problem:
- Compute nodes and tasks can fail for any number of reasons. You basically need a full system log to diagnose some problems.
- We can't just give users the system log, there's too much sensitive information in there and it's practically impossible to reliably know what needs to be redacted.
- And even if it wasn't, regular users mostly can't act on this information, and it may nede to be subject to different retention policies than regular container logs, etc.
Big idea: Crunch occasionally saves the system journal (and other logs?) to a collection that should only be readable by Arvados administrators. Administrators can go back and review these logs to diagnose problems.
Implementation idea:
- crunch-run gains a subcommand to upload the journal to a collection. When you run it, it:
- Runs
journalctl --sync
to make sure all entries so far are written to disk- TBD: Does this need sudo?
- The rest of the work should probably continue even if this command fails. Even if it means we can't get all the logs, we might as well capture what we can.
- Creates a collection from the recursive contents of
/var/log/journal
- TBD: Any other log files we should throw in?
- The collection should have a property that indicates which container(s) these system logs correspond to. This should be a system property with the
arv:
prefix that's documented. - The collection should have a
trashed_at
time in the future. TBD: Should this time be configurable? If it's set to zero, should this functionality be disabled?
- Runs
- crunch-dispatch calls this crunch-run command when specific events occur
- When a container finishes
- When the cloud dispatcher decides to terminate a node
Setup that needs to happen:
- There needs to be a dedicated Unix account on the compute nodes to run this
- It should be a member of the
systemd-journal
group to read the journal - It may need sudo permission to run
journalctl --sync
passwordless
- It should be a member of the
- Permissions can be limited on the Arvados side
- This can't use the same API token as the container because the permissions are completely different
- The token could be scoped pretty narrowly: just permission to
PUT
a collection andGET
the owning project and similar related resources - It seems like we either need (a) a dedicated user account that just has all these journal collections in its home project, or (b) a configurable UUID of a project where all these journal collections are saved
Background:
- Read a saved journal with
journalctl --root=PATH
- We considered setting something up that automatically does this when the node goes down (a service that's
WantedBy=shutdown.target
?) It has the advantage that it could work even if crunch-dispatch has trouble coordinating with the compute node, but:- The upload might take a while and we're not sure if systemd and/or the cloud provider would be patient enough to let it run
- It would require us to permanently store credentials somewhere, which isn't insurmountable but something we generally avoid doing
Related issues
Updated by Brett Smith 8 months ago
- Related to Idea #21424: Way to run a diagnostic container that captures all system logs, not just Crunch's added
Actions