Feature #4598

[Crunch] [DRAFT] Classify job failures by type, report statistics

Added by Ward Vandewege over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
Crunch
Target version:
Start date:
12/12/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

For a given time interval
  • Get a list of jobs created in this interval (no need to report the entire list)
  • Report #succeeded, #failed, #unfinished
  • For each failed job, examine the log file to classify failure
    • Find first permanent task failure
    • Find last few log messages from that task
    • Match against a list of telltale regexps like /Cannot destroy container/ and assign failure code like "sys/docker" (this can be very short at first, we'll refine it over time)
  • Report number of jobs for each failure code

We'll use (and refine) this by picking the largest class(es) of failure codes (sometimes including "unknown"), modifying the regexp list to get more helpful/specific error codes, fixing bugs, improving docs, etc.

Sample report:


Start          2014/12/22 00:00:00
End            2014/12/23 12:54:00

Overview

  Started        31
  Succeeded      12 (39%)
  Failed         18 (58%)
  In progress     1 ( 3%)

Failures by class

  sys/docker      6 (33%)
  user            5 (28%)
  sys/slurm       4 (22%)
  unknown         3 (17%)

Failures by class (detail)

  sys/docker      6 (33%)
    qr1hi-8i9sb-r6yn2i8160nwlma    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8250nwlmb    Ward Vandewege    output of hasher
    qr1hi-8i9sb-r6yn2i8340nwlmc    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8430nwlmd    Ward Vandewege    output of hasher
    qr1hi-8i9sb-r6yn2i8520nwlme    Ward Vandewege    diagnostics hash output
    qr1hi-8i9sb-r6yn2i8610nwlmf    Ward Vandewege    output of hasher

(and so on for each failure class)

Failures by class should be sorted in descending order of occurrence.

Failures by class detail output should look like this for each failure:

four spaces
uuid (27 bytes)
two spaces
name of the user running the job (15 chars, truncate if necessary)
two spaces
name of the job (29 chars, truncate if necessary)

That brings the line length to 79 characters, which is the intention.

For each failure class, tthe list of failed jobs should be sorted by date, ascending.


Subtasks

Task #4807: Review 4598-crunch-failure-statsResolvedTim Pierce

Task #4616: report job failuresResolvedTim Pierce

Task #4615: identify job failure typesResolvedTim Pierce

Associated revisions

Revision a32c4f99
Added by Tim Pierce over 5 years ago

Merge branch '4598-crunch-failure-stats'

Fixes #4598.

History

#1 Updated by Tom Clegg over 5 years ago

  • Subject changed from [Crunch] tally kinds of job failures for a period of time to [Crunch] [DRAFT] Classify job failures by type, report statistics
  • Story points changed from 0.5 to 2.0

#2 Updated by Tim Pierce over 5 years ago

  • Assigned To set to Tim Pierce

#3 Updated by Ward Vandewege over 5 years ago

  • Status changed from New to In Progress

#4 Updated by Tim Pierce over 5 years ago

  • Target version changed from 2014-12-10 sprint to 2015-01-07 sprint

#5 Updated by Tim Pierce over 5 years ago

At b03a6a8:

Needs some tweaks to remove noise from the log lines (additional timestamps from Docker, pids, etc). Need feedback from ops to understand what is most useful here.

Calling syntax:

(arv4598)hitchcock:/home/twp/arvados/services/api/script% ./crunch-failure-report.py --help                                                 
usage: crunch-failure-report.py [-h] [--start START] [--end END]
                                [--match MATCH]

Produce a report of Crunch failures within a specified time range

optional arguments:
  -h, --help     show this help message and exit
  --start START  Start date and time
  --end END      End date and time
  --match MATCH  Regular expression to match on Crunch error output lines.

(The default match expression is 'fail')

Example output:

(arv4598)hitchcock:/home/twp/arvados/services/api/script% ./crunch-failure-report.py --start=2014-12-01T00:00:00Z --end=2014-12-02T00:00:00Z
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 428 stderr 2014/12/01 23:57:23 Error response from daemon: Cannot destroy container 3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d: Driver aufs failed to remove root filesystem 3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d: rename /tmp/docker/aufs/mnt/3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d /tmp/docker/aufs/mnt/3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 84 stderr 2014/12/01 23:53:07 Error response from daemon: Cannot destroy container 3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb: Driver aufs failed to remove root filesystem 3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb: rename /tmp/docker/aufs/mnt/3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb /tmp/docker/aufs/mnt/3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 330 stderr 2014/12/01 23:56:06 Error response from daemon: Cannot destroy container 501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150: Driver aufs failed to remove root filesystem 501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150: rename /tmp/docker/aufs/mnt/501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150 /tmp/docker/aufs/mnt/501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 86 stderr 2014/12/01 23:53:08 Error response from daemon: Cannot destroy container ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd: Driver aufs failed to remove root filesystem ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd: rename /tmp/docker/aufs/mnt/ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd /tmp/docker/aufs/mnt/ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd-removing: device or resource busy: 1
qr1hi-8i9sb-z1sxm0ugggoehqn 15358 0 failure (#1, permanent) after 2 seconds: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 15 stderr 2014/12/01 23:52:22 Error response from daemon: Cannot destroy container 2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001: Driver aufs failed to remove root filesystem 2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001: rename /tmp/docker/aufs/mnt/2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001 /tmp/docker/aufs/mnt/2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 333 stderr 2014/12/01 23:56:08 Error response from daemon: Cannot destroy container f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977: Driver aufs failed to remove root filesystem f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977: rename /tmp/docker/aufs/mnt/f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977 /tmp/docker/aufs/mnt/f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 262 stderr 2014/12/01 23:55:19 Error response from daemon: Cannot destroy container 89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40: Driver aufs failed to remove root filesystem 89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40: rename /tmp/docker/aufs/mnt/89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40 /tmp/docker/aufs/mnt/89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 418 stderr 2014/12/01 23:57:15 Error response from daemon: Cannot destroy container 2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2: Driver aufs failed to remove root filesystem 2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2: rename /tmp/docker/aufs/mnt/2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2 /tmp/docker/aufs/mnt/2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 570 stderr 2014/12/01 23:59:11 Error response from daemon: Cannot destroy container 88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01: Driver aufs failed to remove root filesystem 88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01: rename /tmp/docker/aufs/mnt/88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01 /tmp/docker/aufs/mnt/88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01-removing: device or resource busy: 1
qr1hi-8i9sb-n5nrgn84ou6p5g9 15149 0 failure (#1, permanent) after 3 seconds: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 573 stderr 2014/12/01 23:59:14 Error response from daemon: Cannot destroy container 180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d: Driver aufs failed to remove root filesystem 180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d: rename /tmp/docker/aufs/mnt/180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d /tmp/docker/aufs/mnt/180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 5 stderr 2014/12/01 23:52:17 Error response from daemon: Cannot destroy container 0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34: Driver aufs failed to remove root filesystem 0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34: rename /tmp/docker/aufs/mnt/0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34 /tmp/docker/aufs/mnt/0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 411 stderr 2014/12/01 23:57:09 Error response from daemon: Cannot destroy container 18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3: Driver aufs failed to remove root filesystem 18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3: rename /tmp/docker/aufs/diff/18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3 /tmp/docker/aufs/diff/18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3-removing: device or resource busy: 1
qr1hi-8i9sb-z1sxm0ugggoehqn 15358  Every node has failed -- giving up on this round: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 504 stderr 2014/12/01 23:58:22 Error response from daemon: Cannot destroy container dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168: Driver aufs failed to remove root filesystem dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168: rename /tmp/docker/aufs/mnt/dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168 /tmp/docker/aufs/mnt/dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 180 stderr 2014/12/01 23:54:20 Error response from daemon: Cannot destroy container ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f: Driver aufs failed to remove root filesystem ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f: rename /tmp/docker/aufs/mnt/ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f /tmp/docker/aufs/mnt/ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 182 stderr 2014/12/01 23:54:22 Error response from daemon: Cannot destroy container 63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080: Driver aufs failed to remove root filesystem 63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080: rename /tmp/docker/aufs/mnt/63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080 /tmp/docker/aufs/mnt/63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 17 stderr 2014/12/01 23:52:24 Error response from daemon: Cannot destroy container cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d: Driver aufs failed to remove root filesystem cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d: rename /tmp/docker/aufs/mnt/cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d /tmp/docker/aufs/mnt/cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 8 stderr 2014/12/01 23:52:18 Error response from daemon: Cannot destroy container c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336: Driver aufs failed to remove root filesystem c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336: rename /tmp/docker/aufs/mnt/c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336 /tmp/docker/aufs/mnt/c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336-removing: device or resource busy: 1
qr1hi-8i9sb-old6z4llev4hmop 24262 0 failure (#1, permanent) after 5 seconds: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 501 stderr 2014/12/01 23:58:20 Error response from daemon: Cannot destroy container ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72: Driver aufs failed to remove root filesystem ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72: rename /tmp/docker/aufs/mnt/ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72 /tmp/docker/aufs/mnt/ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 6 stderr 2014/12/01 23:52:17 Error response from daemon: Cannot destroy container 0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5: Driver aufs failed to remove root filesystem 0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5: rename /tmp/docker/aufs/mnt/0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5 /tmp/docker/aufs/mnt/0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 175 stderr 2014/12/01 23:54:17 Error response from daemon: Cannot destroy container 3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072: Driver aufs failed to remove root filesystem 3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072: rename /tmp/docker/aufs/mnt/3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072 /tmp/docker/aufs/mnt/3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 264 stderr 2014/12/01 23:55:20 Error response from daemon: Cannot destroy container 62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142: Driver aufs failed to remove root filesystem 62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142: rename /tmp/docker/aufs/mnt/62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142 /tmp/docker/aufs/mnt/62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 341 stderr 2014/12/01 23:56:14 Error response from daemon: Cannot destroy container bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116: Driver aufs failed to remove root filesystem bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116: rename /tmp/docker/aufs/mnt/bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116 /tmp/docker/aufs/mnt/bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 419 stderr 2014/12/01 23:57:16 Error response from daemon: Cannot destroy container 1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc: Driver aufs failed to remove root filesystem 1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc: rename /tmp/docker/aufs/diff/1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc /tmp/docker/aufs/diff/1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 346 stderr 2014/12/01 23:56:18 Error response from daemon: Cannot destroy container 7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0: Driver aufs failed to remove root filesystem 7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0: rename /tmp/docker/aufs/mnt/7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0 /tmp/docker/aufs/mnt/7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 565 stderr 2014/12/01 23:59:07 Error response from daemon: Cannot destroy container cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5: Driver aufs failed to remove root filesystem cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5: rename /tmp/docker/aufs/mnt/cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5 /tmp/docker/aufs/mnt/cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5-removing: device or resource busy: 1
qr1hi-8i9sb-old6z4llev4hmop 24262  Every node has failed -- giving up on this round: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 409 stderr 2014/12/01 23:57:08 Error response from daemon: Cannot destroy container 39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6: Driver aufs failed to remove root filesystem 39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6: rename /tmp/docker/aufs/mnt/39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6 /tmp/docker/aufs/mnt/39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6-removing: device or resource busy: 1
qr1hi-8i9sb-wepti5u701eu818 13190  Every node has failed -- giving up on this round: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 335 stderr 2014/12/01 23:56:09 Error response from daemon: Cannot destroy container f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b: Driver aufs failed to remove root filesystem f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b: rename /tmp/docker/aufs/mnt/f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b /tmp/docker/aufs/mnt/f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b-removing: device or resource busy: 1
qr1hi-8i9sb-n5nrgn84ou6p5g9 15149  Every node has failed -- giving up on this round: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 171 stderr 2014/12/01 23:54:15 Error response from daemon: Cannot destroy container 9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731: Driver aufs failed to remove root filesystem 9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731: rename /tmp/docker/aufs/mnt/9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731 /tmp/docker/aufs/mnt/9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 349 stderr 2014/12/01 23:56:21 Error response from daemon: Cannot destroy container 62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8: Driver aufs failed to remove root filesystem 62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8: rename /tmp/docker/aufs/mnt/62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8 /tmp/docker/aufs/mnt/62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 102 stderr 2014/12/01 23:53:21 Error response from daemon: Cannot destroy container af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c: Driver aufs failed to remove root filesystem af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c: rename /tmp/docker/aufs/mnt/af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c /tmp/docker/aufs/mnt/af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 267 stderr 2014/12/01 23:55:22 Error response from daemon: Cannot destroy container 172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b: Driver aufs failed to remove root filesystem 172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b: rename /tmp/docker/aufs/mnt/172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b /tmp/docker/aufs/mnt/172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 97 stderr 2014/12/01 23:53:17 Error response from daemon: Cannot destroy container 1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635: Driver aufs failed to remove root filesystem 1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635: rename /tmp/docker/aufs/mnt/1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635 /tmp/docker/aufs/mnt/1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 254 stderr 2014/12/01 23:55:12 Error response from daemon: Cannot destroy container 6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245: Driver aufs failed to remove root filesystem 6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245: rename /tmp/docker/aufs/mnt/6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245 /tmp/docker/aufs/mnt/6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 99 stderr 2014/12/01 23:53:19 Error response from daemon: Cannot destroy container a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5: Driver aufs failed to remove root filesystem a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5: rename /tmp/docker/aufs/mnt/a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5 /tmp/docker/aufs/mnt/a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 582 stderr 2014/12/01 23:59:22 Error response from daemon: Cannot destroy container 97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de: Driver aufs failed to remove root filesystem 97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de: rename /tmp/docker/aufs/diff/97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de /tmp/docker/aufs/diff/97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de-removing: device or resource busy: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 256 stderr 2014/12/01 23:55:14 Error response from daemon: Cannot destroy container a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1: Driver aufs failed to remove root filesystem a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1: rename /tmp/docker/aufs/mnt/a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1 /tmp/docker/aufs/mnt/a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1-removing: device or resource busy: 1
qr1hi-8i9sb-5wqnicleczswhxb 28048 0 failure (#1, permanent) after 2 seconds: 1
qr1hi-8i9sb-5wqnicleczswhxb 28048  Every node has failed -- giving up on this round: 1
qr1hi-8i9sb-wepti5u701eu818 13190 0 failure (#1, permanent) after 1 seconds: 1
qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 19 stderr 2014/12/01 23:52:25 Error response from daemon: Cannot destroy container 8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503: Driver aufs failed to remove root filesystem 8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503: rename /tmp/docker/aufs/mnt/8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503 /tmp/docker/aufs/mnt/8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503-removing: device or resource busy: 1

#6 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)

#7 Updated by Tom Clegg over 5 years ago

  • Category set to Crunch

#8 Updated by Ward Vandewege over 5 years ago

  • Description updated (diff)

#9 Updated by Tim Pierce over 5 years ago

At commit d71422fa, this still needs some work to provide more failure reasons, and some better formatting but the basics are here:

(pysdk)hitchcock:/home/twp/arvados/services/api/script% python crunch-failure-report.py --help
usage: crunch-failure-report.py [-h] [--start START] [--end END]

Produce a report of Crunch failures within a specified time range

optional arguments:
  -h, --help     show this help message and exit
  --start START  Start date and time
  --end END      End date and time

(pysdk)hitchcock:/home/twp/arvados/services/api/script% python crunch-failure-report.py --start=2014-12-29T22:00:00Z
Start: 2014-12-29T22:00:00Z
End:   2014-12-30T19:41:48Z

Overview

  Started:       27
  Successful:    16 (59%)
  Failed:        10 (37%)
  In progress:    1 ( 4%)

Failures by class

  unknown   10 (100%)
  User not found on host    1 (10%)

Failures by class (detail):

  unknown   10 (100%)
    http://crvr.se/qr1hi-8i9sb-h8izsi22winuktc
    http://crvr.se/qr1hi-8i9sb-lulm7egeadtak9p
    http://crvr.se/qr1hi-8i9sb-ffrgm644kycjm2d
    http://crvr.se/qr1hi-8i9sb-o039nu58e3pn7c6
    http://crvr.se/qr1hi-8i9sb-4tbab0tch69bl5h
    http://crvr.se/qr1hi-8i9sb-jkg3xb3kvcrkvde
    http://crvr.se/qr1hi-8i9sb-mhluou03soxe8f5
    http://crvr.se/qr1hi-8i9sb-5tbv8nsti6vdmic
    http://crvr.se/qr1hi-8i9sb-f38dnwpkyasuw0e
    http://crvr.se/qr1hi-8i9sb-2ba81yr8tnxbtx2

  User not found on host    1 (10%)
    http://crvr.se/qr1hi-8i9sb-h8izsi22winuktc

#10 Updated by Tim Pierce over 5 years ago

now at 1ee492f3, this version knows about different failure types and knows about short names like "sys/docker".

Ward, I think you had some strong desires for the output formatting of this report and ordering of failures which I didn't see captured in the story. Let me know what the specifics are here.

#11 Updated by Ward Vandewege over 5 years ago

  • Description updated (diff)

#12 Updated by Ward Vandewege over 5 years ago

  • Description updated (diff)

#13 Updated by Ward Vandewege over 5 years ago

Reviewing 1ee492f33846d35b4ead20fbdbbc3b496719bd86

a) I've updated the report description with a bit more detail about the desired failure by class (detail) with regard to sorting, and a few extra output fields. And I dropped the url part of the uuid output.

b) I ran the report on qr1hi for the past week, and it worked. Two problems:

- sorting by descending number of occurrences is not working in the failures by class sections
- unknown is not 100%, that is wrong, it's 94%

Failures by class

  sys/docker 4 ( 4%)
  unknown 101 (100%)
  crunch/node 1 ( 1%)

c) I ran the report on qr1hi for the past month, and it bombed out:

$ python crunch-failure-report.py --start '2014-12-01T00:00:00Z' --end '2014-12-31T23:59:59Z'
Traceback (most recent call last):
  File "crunch-failure-report.py", line 169, in <module>
    sys.exit(main())
  File "crunch-failure-report.py", line 99, in main
    logs = job_logs(api, job)
  File "crunch-failure-report.py", line 69, in job_logs
    log_collection = arvados.CollectionReader(job['log'], api)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 127, in __init__
    "Argument to CollectionReader must be a manifest or a collection UUID")
arvados.errors.ArgumentError: Argument to CollectionReader must be a manifest or a collection UUID

d) formatting details:

Looking at the run output below, please fix these things:

- no colons after Started/Successful/...

- fix horizontal alignment under failures by class, also allow for longer failure class names

- no colon after "Failures by class (detail)"

Start: 2014-12-16T00:00:00Z
End:   2014-12-31T23:59:59Z

Overview

  Started:      372
  Successful:   250 (67%)
  Failed:       101 (27%)
  In progress:   21 ( 6%)

Failures by class

  sys/docker    4 ( 4%)
  unknown  101 (100%)
  crunch/node    1 ( 1%)

Failures by class (detail):

  sys/docker    4 ( 4%)
    http://crvr.se/qr1hi-8i9sb-1lh08gfnmcoqx9r
    http://crvr.se/qr1hi-8i9sb-s6t07tokevpjg28
    http://crvr.se/qr1hi-8i9sb-ch7olbbw1nh7u8r
    http://crvr.se/qr1hi-8i9sb-v949by9i93ivf2f

  unknown  101 (100%)
    http://crvr.se/qr1hi-8i9sb-0vhlqfabrjycw7o
    http://crvr.se/qr1hi-8i9sb-2x534aipm08ki75
    http://crvr.se/qr1hi-8i9sb-xyqnaboeq0ji440
    http://crvr.se/qr1hi-8i9sb-khrdlhdwenanqgw
    http://crvr.se/qr1hi-8i9sb-by9rqblusoydhm7
    http://crvr.se/qr1hi-8i9sb-2099gt31wvzmzm0
    http://crvr.se/qr1hi-8i9sb-melwo91k6vvjaow
    http://crvr.se/qr1hi-8i9sb-mkoxrkxhpxlbn63
    http://crvr.se/qr1hi-8i9sb-482dzijmokax2xi
    http://crvr.se/qr1hi-8i9sb-tzosb2voqbsvwtw
    http://crvr.se/qr1hi-8i9sb-v949by9i93ivf2f
    http://crvr.se/qr1hi-8i9sb-61pkx5azm8b5tj5
    http://crvr.se/qr1hi-8i9sb-5swvl2s3b2t1gg9
    http://crvr.se/qr1hi-8i9sb-z4jvf722g7b27m7
    http://crvr.se/qr1hi-8i9sb-dkuzbfnkx1o5mc9
    http://crvr.se/qr1hi-8i9sb-6iknziykdpuchx8
    http://crvr.se/qr1hi-8i9sb-1da5h4bem5lz3pq
    http://crvr.se/qr1hi-8i9sb-honjl0u5dozf94p
    http://crvr.se/qr1hi-8i9sb-vsnbi9p7zk2hhhu
    http://crvr.se/qr1hi-8i9sb-df4rmrwh7xfmr07
    http://crvr.se/qr1hi-8i9sb-01q7xqilnbz9au5
    http://crvr.se/qr1hi-8i9sb-6t21aff3ui9cs9l
    http://crvr.se/qr1hi-8i9sb-f82832fcep7c0o3
    http://crvr.se/qr1hi-8i9sb-3daeal2024jg5pr
...

#14 Updated by Ward Vandewege over 5 years ago

  • Description updated (diff)

#15 Updated by Tim Pierce over 5 years ago

Fixes at e5dce69:

Bugs fixed:
  • Correct counting and percentage calculation of job failures.
    • Jobs were getting categorized as both "unknown" and as a specific failure type.
  • Crashes fixed: should not raise any unhandled exceptions.
Formatting fixes:
  • Itemized failures are now sorted in descending order by failure type
  • Better horizontal alignment
  • Modified formatting to account for updated description.

#16 Updated by Ward Vandewege over 5 years ago

Review comments:

Sorting is still not working in failures by class and failures by class (detail). For example, from the 2014-12 report on qr1hi:

Failures by class

  sys/docker                  34 ( 14%)
  unknown                    204 ( 85%)
  crunch/node                  1 (  0%)

Another bug (?):

Start: 2014-12-01T00:00:00Z
End:   2015-01-01T00:00:00Z

Overview

  Started                    750
  Successful                 473 ( 63%)
  Failed                     239 ( 32%)
  In progress                 38 (  5%)

It seems hard to believe there would be 38 jobs still in progress that started in December 2014.

#17 Updated by Tim Pierce over 5 years ago

Fixed at a686dcb:

The script failed to take into account jobs in a Queued or Cancelled state. It now reports on all five job states explicitly (Complete, Failed, Queued, Cancelled, Running).

The code to sort the failure states has been corrected.

#18 Updated by Ward Vandewege over 5 years ago

Review comments:

$ ./crunch-failure-report.py 
Start: 2015-01-05T19:25:33Z
End:   2015-01-06T19:25:33Z

Overview

  Started                     59
  Complete                    30 ( 51%)
  Failed                      29 ( 49%)

Traceback (most recent call last):
  File "./crunch-failure-report.py", line 219, in <module>
    sys.exit(main())
  File "./crunch-failure-report.py", line 203, in main
    job_name = job_pipeline_name(api, job_info['uuid'])
  File "./crunch-failure-report.py", line 107, in job_pipeline_name
    job_pipeline_names[job_uuid] = _lookup_pipeline_name(api, job_uuid)
  File "./crunch-failure-report.py", line 101, in _lookup_pipeline_name
    pt = api.pipeline_templates().get(uuid=pi['pipeline_template_uuid']).execute()
  File "/usr/local/lib/python2.7/dist-packages/apiclient/discovery.py", line 583, in method
    raise TypeError('Missing required parameter "%s"' % name)
TypeError: Missing required parameter "uuid" 

Assuming that the problem above is not a sdk version problem, please do not return this branch for review before you can

a) run ./crunch-failure-report.py without arguments on qr1hi

b) run ./crunch-failure-report.py for the entire month of 2014-12 on qr1hi

The report for the month of December looks good.

#19 Updated by Tim Pierce over 5 years ago

Updated at 4b9208f2b with more comprehensive exception handling. (If a failed pipeline instance has no pipeline_template_uuid, it will be rendered with a blank pipeline name in the failure details.)

Ward, I'm sorry for failing to check each push more comprehensively. I have tested this version with:

  • no arguments (this defaulted to --start=2015-01-05T20:59:51Z --end=2015-01-06T20:59:51Z)
  • --start=2014-12-01T00:00:00Z --end=2015-01-01T00:00:00Z
  • --start=2014-11-01T00:00:00Z --end=2014-12-01T00:00:00Z
  • --start=2014-11-01T00:00:00Z and no --end parameter (for the maximum possible coverage of testing all job states and all known failure types at once).

In each case I checked that the error types are sorted consistently from most common to least common, that the job counts and error counts add up consistently, and that the percentages reported match the job counts listed.

#20 Updated by Ward Vandewege over 5 years ago

Looking good now. I think this can be merged. Thanks!

#21 Updated by Tim Pierce over 5 years ago

I made one last change: renaming crunch-failure-report.py to crunch_failure_report.py, which at least permits importing it and therefore (someday) tests to be added.

Merged and pushed. Thanks for your feedback and patience.

#22 Updated by Tim Pierce over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:a32c4f9997a0c8941b62668c5e59941985359c05.

Also available in: Atom PDF