Idea #21581
openCrunch saves compute node journals to collections readable only by administrators
Description
Problem:
- Compute nodes and tasks can fail for any number of reasons. You basically need a full system log to diagnose some problems.
- We can't just give users the system log, there's too much sensitive information in there and it's practically impossible to reliably know what needs to be redacted.
- And even if it wasn't, regular users mostly can't act on this information, and it may nede to be subject to different retention policies than regular container logs, etc.
Big idea: Crunch occasionally saves the system journal (and other logs?) to a collection that should only be readable by Arvados administrators. Administrators can go back and review these logs to diagnose problems.
Implementation idea:
- crunch-run gains a subcommand to upload the journal to a collection. When you run it, it:
- Runs
journalctl --sync
to make sure all entries so far are written to disk- TBD: Does this need sudo?
- The rest of the work should probably continue even if this command fails. Even if it means we can't get all the logs, we might as well capture what we can.
- Creates a collection from the recursive contents of
/var/log/journal
- TBD: Any other log files we should throw in?
- The collection should have a property that indicates which container(s) these system logs correspond to. This should be a system property with the
arv:
prefix that's documented. - The collection should have a
trashed_at
time in the future. TBD: Should this time be configurable? If it's set to zero, should this functionality be disabled?
- Runs
- crunch-dispatch calls this crunch-run command when specific events occur
- When a container finishes
- When the cloud dispatcher decides to terminate a node
Setup that needs to happen:
- There needs to be a dedicated Unix account on the compute nodes to run this
- It should be a member of the
systemd-journal
group to read the journal - It may need sudo permission to run
journalctl --sync
passwordless
- It should be a member of the
- Permissions can be limited on the Arvados side
- This can't use the same API token as the container because the permissions are completely different
- The token could be scoped pretty narrowly: just permission to
PUT
a collection andGET
the owning project and similar related resources - It seems like we either need (a) a dedicated user account that just has all these journal collections in its home project, or (b) a configurable UUID of a project where all these journal collections are saved
Background:
- Read a saved journal with
journalctl --root=PATH
- We considered setting something up that automatically does this when the node goes down (a service that's
WantedBy=shutdown.target
?) It has the advantage that it could work even if crunch-dispatch has trouble coordinating with the compute node, but:- The upload might take a while and we're not sure if systemd and/or the cloud provider would be patient enough to let it run
- It would require us to permanently store credentials somewhere, which isn't insurmountable but something we generally avoid doing
Updated by Brett Smith about 1 year ago
- Related to Idea #21424: Way to run a diagnostic container that captures all system logs, not just Crunch's added
Updated by Brett Smith 2 days ago
- Related to Bug #17186: [dispatch] crunch-run logs should be copied to a-d-c logs if crunch-run can't save a log collection added
Updated by Tom Clegg 1 day ago
There is a class of worker node/image problems that prevent crunch-run from completing any Arvados API calls (e.g., network, firewall, and name resolution issues). So we'll still have vanishing-log problems if we rely entirely on crunch-run saving logs to a collection.
If crunch-run can run at all, we know arvados-dispatch-cloud can execute SSH commands on the worker node. So, to cover the most ground, we can have arvados-dispatch-cloud preserve the logs by running journalctl --sync && tar cf - /var/log/journal
on the worker and extracting that to a collection. (If we really want to minimize network traffic, we can implement both: crunch-run tries to save a log collection itself, but if that doesn't work, arvados-dispatch-cloud does it.)