Project

General

Profile

Actions

Idea #21581

open

Crunch saves compute node journals to collections readable only by administrators

Added by Brett Smith 7 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
Story points:
-

Description

Problem:

  • Compute nodes and tasks can fail for any number of reasons. You basically need a full system log to diagnose some problems.
  • We can't just give users the system log, there's too much sensitive information in there and it's practically impossible to reliably know what needs to be redacted.
  • And even if it wasn't, regular users mostly can't act on this information, and it may nede to be subject to different retention policies than regular container logs, etc.

Big idea: Crunch occasionally saves the system journal (and other logs?) to a collection that should only be readable by Arvados administrators. Administrators can go back and review these logs to diagnose problems.

Implementation idea:

  • crunch-run gains a subcommand to upload the journal to a collection. When you run it, it:
    • Runs journalctl --sync to make sure all entries so far are written to disk
      • TBD: Does this need sudo?
      • The rest of the work should probably continue even if this command fails. Even if it means we can't get all the logs, we might as well capture what we can.
    • Creates a collection from the recursive contents of /var/log/journal
      • TBD: Any other log files we should throw in?
      • The collection should have a property that indicates which container(s) these system logs correspond to. This should be a system property with the arv: prefix that's documented.
      • The collection should have a trashed_at time in the future. TBD: Should this time be configurable? If it's set to zero, should this functionality be disabled?
  • crunch-dispatch calls this crunch-run command when specific events occur
    • When a container finishes
    • When the cloud dispatcher decides to terminate a node

Setup that needs to happen:

  • There needs to be a dedicated Unix account on the compute nodes to run this
    • It should be a member of the systemd-journal group to read the journal
    • It may need sudo permission to run journalctl --sync passwordless
  • Permissions can be limited on the Arvados side
    • This can't use the same API token as the container because the permissions are completely different
    • The token could be scoped pretty narrowly: just permission to PUT a collection and GET the owning project and similar related resources
    • It seems like we either need (a) a dedicated user account that just has all these journal collections in its home project, or (b) a configurable UUID of a project where all these journal collections are saved

Background:

  • Read a saved journal with journalctl --root=PATH
  • We considered setting something up that automatically does this when the node goes down (a service that's WantedBy=shutdown.target?) It has the advantage that it could work even if crunch-dispatch has trouble coordinating with the compute node, but:
    • The upload might take a while and we're not sure if systemd and/or the cloud provider would be patient enough to let it run
    • It would require us to permanently store credentials somewhere, which isn't insurmountable but something we generally avoid doing

Related issues

Related to Arvados - Idea #21424: Way to run a diagnostic container that captures all system logs, not just Crunch'sNewActions
Actions #1

Updated by Brett Smith 7 months ago

  • Related to Idea #21424: Way to run a diagnostic container that captures all system logs, not just Crunch's added
Actions #2

Updated by Peter Amstutz 5 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF