Project

General

Profile

Actions

Troubleshooting aids » History » Revision 2

« Previous | Revision 2/5 (diff) | Next »
Tom Clegg, 04/03/2024 08:08 PM


Troubleshooting aids

Troubleshoot usage problems:
  • Improve error messages (e.g., clients should not crash and dump stack when a server is slow/unresponsive)
Troubleshoot compute nodes/images:
  • Idea #21581: Crunch saves compute node journals to collections readable only by administrators
  • Idea #21424: Way to run a diagnostic container that captures all system logs, not just Crunch's
Troubleshoot arvados system services:
  • Save snapshot of internals (goroutines / memory profile) of specified system service(s) to a collection, and provide instructions for viewing
  • Save last N minutes of logs from all arvados services running on this host
  • Turn on debug mode temporarily, without restarting services
Expose config/scaling issues:
  • Scan metrics for recent "near/at capacity" signals
  • Probe for proper nginx/proxy config (e.g., max request body size)

Updated by Tom Clegg 9 months ago · 5 revisions