Project

General

Profile

Actions

Feature #4598

closed

[Crunch] [DRAFT] Classify job failures by type, report statistics

Added by Ward Vandewege over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
Crunch
Target version:
Story points:
2.0

Description

For a given time interval
  • Get a list of jobs created in this interval (no need to report the entire list)
  • Report #succeeded, #failed, #unfinished
  • For each failed job, examine the log file to classify failure
    • Find first permanent task failure
    • Find last few log messages from that task
    • Match against a list of telltale regexps like /Cannot destroy container/ and assign failure code like "sys/docker" (this can be very short at first, we'll refine it over time)
  • Report number of jobs for each failure code

We'll use (and refine) this by picking the largest class(es) of failure codes (sometimes including "unknown"), modifying the regexp list to get more helpful/specific error codes, fixing bugs, improving docs, etc.

Sample report:


Start          2014/12/22 00:00:00
End            2014/12/23 12:54:00

Overview

  Started        31
  Succeeded      12 (39%)
  Failed         18 (58%)
  In progress     1 ( 3%)

Failures by class

  sys/docker      6 (33%)
  user            5 (28%)
  sys/slurm       4 (22%)
  unknown         3 (17%)

Failures by class (detail)

  sys/docker      6 (33%)
    qr1hi-8i9sb-r6yn2i8160nwlma    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8250nwlmb    Ward Vandewege    output of hasher
    qr1hi-8i9sb-r6yn2i8340nwlmc    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8430nwlmd    Ward Vandewege    output of hasher
    qr1hi-8i9sb-r6yn2i8520nwlme    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8610nwlmf    Ward Vandewege    output of hasher

(and so on for each failure class)

Failures by class should be sorted in descending order of occurrence.

Failures by class detail output should look like this for each failure:

four spaces
uuid (27 bytes)
two spaces
name of the user running the job (15 chars, truncate if necessary)
two spaces
name of the job (29 chars, truncate if necessary)

That brings the line length to 79 characters, which is the intention.

For each failure class, tthe list of failed jobs should be sorted by date, ascending.


Subtasks 3 (0 open3 closed)

Task #4807: Review 4598-crunch-failure-statsResolvedTim Pierce12/12/2014Actions
Task #4616: report job failuresResolvedTim Pierce12/12/2014Actions
Task #4615: identify job failure typesResolvedTim Pierce12/12/2014Actions
Actions

Also available in: Atom PDF