Feature #4598
Updated by Ward Vandewege about 10 years ago
For a given time interval * Get a list of jobs created in this interval (no need to report the entire list) * Report #succeeded, #failed, #unfinished * For each failed job, examine the log file to classify failure ** Find first permanent task failure ** Find last few log messages from that task ** Match against a list of telltale regexps like @/Cannot destroy container/@ and assign failure code like @"sys/docker"@ (this can be very short at first, we'll refine it over time) * Report number of jobs for each failure code We'll use (and refine) this by picking the largest class(es) of failure codes (sometimes including "unknown"), modifying the regexp list to get more helpful/specific error codes, fixing bugs, improving docs, etc. Sample report: <pre> Start 2014/12/22 00:00:00 End 2014/12/23 12:54:00 Overview Started 31 Succeeded 12 (39%) Failed 18 (58%) In progress 1 ( 3%) Failures by class: sys/docker 6 (33%) user 5 (28%) sys/slurm 4 (22%) unknown 3 (17%) Failures by class (detail): sys/docker 6 (33%) http://curover.se/qr1hi-8i9sb-r6yn2i8160nwlma http://curover.se/qr1hi-8i9sb-r6yn2i8250nwlmb http://curover.se/qr1hi-8i9sb-r6yn2i8340nwlmc http://curover.se/qr1hi-8i9sb-r6yn2i8430nwlmd http://curover.se/qr1hi-8i9sb-r6yn2i8520nwlme http://curover.se/qr1hi-8i9sb-r6yn2i8610nwlmf (and so on for each failure class) </pre>