Feature #4598
closed[Crunch] [DRAFT] Classify job failures by type, report statistics
Description
- Get a list of jobs created in this interval (no need to report the entire list)
- Report #succeeded, #failed, #unfinished
- For each failed job, examine the log file to classify failure
- Find first permanent task failure
- Find last few log messages from that task
- Match against a list of telltale regexps like
/Cannot destroy container/
and assign failure code like"sys/docker"
(this can be very short at first, we'll refine it over time)
- Report number of jobs for each failure code
We'll use (and refine) this by picking the largest class(es) of failure codes (sometimes including "unknown"), modifying the regexp list to get more helpful/specific error codes, fixing bugs, improving docs, etc.
Sample report:
Start 2014/12/22 00:00:00 End 2014/12/23 12:54:00 Overview Started 31 Succeeded 12 (39%) Failed 18 (58%) In progress 1 ( 3%) Failures by class sys/docker 6 (33%) user 5 (28%) sys/slurm 4 (22%) unknown 3 (17%) Failures by class (detail) sys/docker 6 (33%) qr1hi-8i9sb-r6yn2i8160nwlma Ward Vandewege diagnostics hash output qr1hi-8i9sb-r6yn2i8250nwlmb Ward Vandewege output of hasher qr1hi-8i9sb-r6yn2i8340nwlmc Ward Vandewege diagnostics hash output qr1hi-8i9sb-r6yn2i8430nwlmd Ward Vandewege output of hasher qr1hi-8i9sb-r6yn2i8520nwlme Ward Vandewege diagnostics hash output qr1hi-8i9sb-r6yn2i8610nwlmf Ward Vandewege output of hasher (and so on for each failure class)
Failures by class should be sorted in descending order of occurrence.
Failures by class detail output should look like this for each failure:
four spaces
uuid (27 bytes)
two spaces
name of the user running the job (15 chars, truncate if necessary)
two spaces
name of the job (29 chars, truncate if necessary)
That brings the line length to 79 characters, which is the intention.
For each failure class, tthe list of failed jobs should be sorted by date, ascending.
Updated by Tom Clegg about 10 years ago
- Subject changed from [Crunch] tally kinds of job failures for a period of time to [Crunch] [DRAFT] Classify job failures by type, report statistics
- Story points changed from 0.5 to 2.0
Updated by Ward Vandewege almost 10 years ago
- Status changed from New to In Progress
Updated by Tim Pierce almost 10 years ago
- Target version changed from 2014-12-10 sprint to 2015-01-07 sprint
Updated by Tim Pierce almost 10 years ago
At b03a6a8:
Needs some tweaks to remove noise from the log lines (additional timestamps from Docker, pids, etc). Need feedback from ops to understand what is most useful here.
Calling syntax:
(arv4598)hitchcock:/home/twp/arvados/services/api/script% ./crunch-failure-report.py --help usage: crunch-failure-report.py [-h] [--start START] [--end END] [--match MATCH] Produce a report of Crunch failures within a specified time range optional arguments: -h, --help show this help message and exit --start START Start date and time --end END End date and time --match MATCH Regular expression to match on Crunch error output lines.
(The default match expression is 'fail')
Example output:
(arv4598)hitchcock:/home/twp/arvados/services/api/script% ./crunch-failure-report.py --start=2014-12-01T00:00:00Z --end=2014-12-02T00:00:00Z qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 428 stderr 2014/12/01 23:57:23 Error response from daemon: Cannot destroy container 3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d: Driver aufs failed to remove root filesystem 3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d: rename /tmp/docker/aufs/mnt/3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d /tmp/docker/aufs/mnt/3f6132fdbd376e0a5a7e797929cad6f47b516dd4125d06fb44c219f294c26f9d-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 84 stderr 2014/12/01 23:53:07 Error response from daemon: Cannot destroy container 3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb: Driver aufs failed to remove root filesystem 3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb: rename /tmp/docker/aufs/mnt/3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb /tmp/docker/aufs/mnt/3f70d7c88276914441dcd9f9e3a7eb254f1ed90f9cef445781543de897664cbb-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 330 stderr 2014/12/01 23:56:06 Error response from daemon: Cannot destroy container 501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150: Driver aufs failed to remove root filesystem 501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150: rename /tmp/docker/aufs/mnt/501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150 /tmp/docker/aufs/mnt/501d9a5aab5fceebc7494d1a9c5d2c3c7b406bf2b0fb9dab01eebb546dd69150-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 86 stderr 2014/12/01 23:53:08 Error response from daemon: Cannot destroy container ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd: Driver aufs failed to remove root filesystem ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd: rename /tmp/docker/aufs/mnt/ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd /tmp/docker/aufs/mnt/ee1270f705c77f4e985437abf0c67ea0231e6eb22b70814e8263f6f5ef7e15cd-removing: device or resource busy: 1 qr1hi-8i9sb-z1sxm0ugggoehqn 15358 0 failure (#1, permanent) after 2 seconds: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 15 stderr 2014/12/01 23:52:22 Error response from daemon: Cannot destroy container 2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001: Driver aufs failed to remove root filesystem 2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001: rename /tmp/docker/aufs/mnt/2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001 /tmp/docker/aufs/mnt/2fb1254f18bde47ae1ab13a4b2b987ab31d8411cb255fd2c0c6b2a2e76416001-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 333 stderr 2014/12/01 23:56:08 Error response from daemon: Cannot destroy container f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977: Driver aufs failed to remove root filesystem f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977: rename /tmp/docker/aufs/mnt/f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977 /tmp/docker/aufs/mnt/f3154195522711fb1ae30b63fe7e447d831e0ecc83c41c6190cb2b34a8a69977-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 262 stderr 2014/12/01 23:55:19 Error response from daemon: Cannot destroy container 89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40: Driver aufs failed to remove root filesystem 89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40: rename /tmp/docker/aufs/mnt/89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40 /tmp/docker/aufs/mnt/89babfa33f3632c9f1c8bebffbd06cc9e1ccce3e81518ec9cc7e761899998b40-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 418 stderr 2014/12/01 23:57:15 Error response from daemon: Cannot destroy container 2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2: Driver aufs failed to remove root filesystem 2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2: rename /tmp/docker/aufs/mnt/2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2 /tmp/docker/aufs/mnt/2c447e04628f21b53bc6fbe14765e9f82531ee19e106e9a0348957be807553d2-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 570 stderr 2014/12/01 23:59:11 Error response from daemon: Cannot destroy container 88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01: Driver aufs failed to remove root filesystem 88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01: rename /tmp/docker/aufs/mnt/88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01 /tmp/docker/aufs/mnt/88c57dcaf80adb0f468ab8b85b6b0da6fa4ab4dcffebca080048dd266180cb01-removing: device or resource busy: 1 qr1hi-8i9sb-n5nrgn84ou6p5g9 15149 0 failure (#1, permanent) after 3 seconds: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 573 stderr 2014/12/01 23:59:14 Error response from daemon: Cannot destroy container 180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d: Driver aufs failed to remove root filesystem 180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d: rename /tmp/docker/aufs/mnt/180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d /tmp/docker/aufs/mnt/180e4c1b69a773b5242d886bc5814fb40e06c49ed8554a8e6514e8e3bcfac62d-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 5 stderr 2014/12/01 23:52:17 Error response from daemon: Cannot destroy container 0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34: Driver aufs failed to remove root filesystem 0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34: rename /tmp/docker/aufs/mnt/0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34 /tmp/docker/aufs/mnt/0a1184c8f8836aad6c49bb2136e916baa5a37b2884a8288cf2526e600decbd34-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 411 stderr 2014/12/01 23:57:09 Error response from daemon: Cannot destroy container 18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3: Driver aufs failed to remove root filesystem 18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3: rename /tmp/docker/aufs/diff/18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3 /tmp/docker/aufs/diff/18f5a7fdaa17d6bc250d8d0fb4063fead0bab042d587a73cfd0738fb231682a3-removing: device or resource busy: 1 qr1hi-8i9sb-z1sxm0ugggoehqn 15358 Every node has failed -- giving up on this round: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 504 stderr 2014/12/01 23:58:22 Error response from daemon: Cannot destroy container dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168: Driver aufs failed to remove root filesystem dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168: rename /tmp/docker/aufs/mnt/dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168 /tmp/docker/aufs/mnt/dcf455533c1fcf0e8247742fc59e33927585f0a177823fe6280b0f66946c3168-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 180 stderr 2014/12/01 23:54:20 Error response from daemon: Cannot destroy container ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f: Driver aufs failed to remove root filesystem ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f: rename /tmp/docker/aufs/mnt/ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f /tmp/docker/aufs/mnt/ab905462517855f1d8029725f878c736b9d44671cf0e0647bf0dc7009d09982f-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 182 stderr 2014/12/01 23:54:22 Error response from daemon: Cannot destroy container 63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080: Driver aufs failed to remove root filesystem 63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080: rename /tmp/docker/aufs/mnt/63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080 /tmp/docker/aufs/mnt/63e919966e1c7ede25bdf118edc67ebf1b1fc3b05f191b86409ed97ad19aa080-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 17 stderr 2014/12/01 23:52:24 Error response from daemon: Cannot destroy container cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d: Driver aufs failed to remove root filesystem cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d: rename /tmp/docker/aufs/mnt/cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d /tmp/docker/aufs/mnt/cb5f5d0e2b6b3fc95ac16770ce275d3823eead268479378b84668bbc8680952d-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 8 stderr 2014/12/01 23:52:18 Error response from daemon: Cannot destroy container c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336: Driver aufs failed to remove root filesystem c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336: rename /tmp/docker/aufs/mnt/c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336 /tmp/docker/aufs/mnt/c336e8e34d7ac6f0eb3bfc6a9ad86249d23d260088b4fd55a57ca3cc23e70336-removing: device or resource busy: 1 qr1hi-8i9sb-old6z4llev4hmop 24262 0 failure (#1, permanent) after 5 seconds: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 501 stderr 2014/12/01 23:58:20 Error response from daemon: Cannot destroy container ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72: Driver aufs failed to remove root filesystem ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72: rename /tmp/docker/aufs/mnt/ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72 /tmp/docker/aufs/mnt/ba79d4676736571b11b275bc889a4c81e88eb6ad5225804b3245b29ac5d45f72-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 6 stderr 2014/12/01 23:52:17 Error response from daemon: Cannot destroy container 0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5: Driver aufs failed to remove root filesystem 0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5: rename /tmp/docker/aufs/mnt/0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5 /tmp/docker/aufs/mnt/0e40c89cd408af0e4c7d7263ba48e607568673c760c1fb47d978488072ad19a5-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 175 stderr 2014/12/01 23:54:17 Error response from daemon: Cannot destroy container 3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072: Driver aufs failed to remove root filesystem 3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072: rename /tmp/docker/aufs/mnt/3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072 /tmp/docker/aufs/mnt/3a9cd4bdd819e60a62ac8bef8226a8a0d4b47d2f1c75e215088c92602074e072-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 264 stderr 2014/12/01 23:55:20 Error response from daemon: Cannot destroy container 62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142: Driver aufs failed to remove root filesystem 62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142: rename /tmp/docker/aufs/mnt/62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142 /tmp/docker/aufs/mnt/62c3a3cf35b7b2dc67f8113d1f14c226536d343717aaa06c041a6b27e43c2142-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 341 stderr 2014/12/01 23:56:14 Error response from daemon: Cannot destroy container bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116: Driver aufs failed to remove root filesystem bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116: rename /tmp/docker/aufs/mnt/bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116 /tmp/docker/aufs/mnt/bbf658e6a1813908150fb32c629bb2b962d1ace02f9a50adb8a49ad544d2c116-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 419 stderr 2014/12/01 23:57:16 Error response from daemon: Cannot destroy container 1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc: Driver aufs failed to remove root filesystem 1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc: rename /tmp/docker/aufs/diff/1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc /tmp/docker/aufs/diff/1b4afafaf291fc3e85a7b82538e434fc1cf8e65e9c8191c10f6dfb3c1a80e3dc-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 346 stderr 2014/12/01 23:56:18 Error response from daemon: Cannot destroy container 7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0: Driver aufs failed to remove root filesystem 7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0: rename /tmp/docker/aufs/mnt/7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0 /tmp/docker/aufs/mnt/7c8f8781aad17514281aa256b716b2ac86e242dace29ccae1b5856da4b7777a0-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 565 stderr 2014/12/01 23:59:07 Error response from daemon: Cannot destroy container cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5: Driver aufs failed to remove root filesystem cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5: rename /tmp/docker/aufs/mnt/cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5 /tmp/docker/aufs/mnt/cd23c143636afac6f4d6d8d89a46181d73d9076af1cd9765c08de7f773a079a5-removing: device or resource busy: 1 qr1hi-8i9sb-old6z4llev4hmop 24262 Every node has failed -- giving up on this round: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 409 stderr 2014/12/01 23:57:08 Error response from daemon: Cannot destroy container 39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6: Driver aufs failed to remove root filesystem 39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6: rename /tmp/docker/aufs/mnt/39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6 /tmp/docker/aufs/mnt/39c618053ba35908cdbbffaa655d9f26d22d8cfbbb531c1a800d413f1310e2b6-removing: device or resource busy: 1 qr1hi-8i9sb-wepti5u701eu818 13190 Every node has failed -- giving up on this round: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 335 stderr 2014/12/01 23:56:09 Error response from daemon: Cannot destroy container f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b: Driver aufs failed to remove root filesystem f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b: rename /tmp/docker/aufs/mnt/f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b /tmp/docker/aufs/mnt/f759d811ce8436cfbce70f8b598ae6e4bdd71f2b1e4948dcc5904f5ed06bfe6b-removing: device or resource busy: 1 qr1hi-8i9sb-n5nrgn84ou6p5g9 15149 Every node has failed -- giving up on this round: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 171 stderr 2014/12/01 23:54:15 Error response from daemon: Cannot destroy container 9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731: Driver aufs failed to remove root filesystem 9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731: rename /tmp/docker/aufs/mnt/9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731 /tmp/docker/aufs/mnt/9c2c8a06c116cb0c538d0d486e934141c20ff504af30d5a74f6d5ad51ff7a731-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 349 stderr 2014/12/01 23:56:21 Error response from daemon: Cannot destroy container 62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8: Driver aufs failed to remove root filesystem 62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8: rename /tmp/docker/aufs/mnt/62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8 /tmp/docker/aufs/mnt/62c3953588846c35bb227b1fa5f7bd0297ed01e56d5a6cff4e628c9219a226a8-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 102 stderr 2014/12/01 23:53:21 Error response from daemon: Cannot destroy container af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c: Driver aufs failed to remove root filesystem af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c: rename /tmp/docker/aufs/mnt/af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c /tmp/docker/aufs/mnt/af5abb30979ca60d8a6d59f0f267823d42d238517a7ceab135802e53ac7f105c-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 267 stderr 2014/12/01 23:55:22 Error response from daemon: Cannot destroy container 172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b: Driver aufs failed to remove root filesystem 172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b: rename /tmp/docker/aufs/mnt/172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b /tmp/docker/aufs/mnt/172a2d79622c923fb698ea49ebacec38d58fc45f0a20c95577ca6b2dc6d4cf5b-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 97 stderr 2014/12/01 23:53:17 Error response from daemon: Cannot destroy container 1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635: Driver aufs failed to remove root filesystem 1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635: rename /tmp/docker/aufs/mnt/1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635 /tmp/docker/aufs/mnt/1f755c53daac0519ab899ca6d5f442148d9e73c612a1a95871e9dafd94a3f635-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 254 stderr 2014/12/01 23:55:12 Error response from daemon: Cannot destroy container 6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245: Driver aufs failed to remove root filesystem 6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245: rename /tmp/docker/aufs/mnt/6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245 /tmp/docker/aufs/mnt/6a5ab4431228cd18cf4962adff3499f5a563128ce0cf9ac26cb9911dbb0fb245-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 99 stderr 2014/12/01 23:53:19 Error response from daemon: Cannot destroy container a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5: Driver aufs failed to remove root filesystem a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5: rename /tmp/docker/aufs/mnt/a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5 /tmp/docker/aufs/mnt/a932c47d486008f2d1d1598e34e1eaebbb1f3058182e82c4f0aa0e641975c7d5-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 582 stderr 2014/12/01 23:59:22 Error response from daemon: Cannot destroy container 97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de: Driver aufs failed to remove root filesystem 97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de: rename /tmp/docker/aufs/diff/97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de /tmp/docker/aufs/diff/97070e17997645a9d5cc39c942f1de9ca1f8f43ea19f66c59ef859e45fc543de-removing: device or resource busy: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 256 stderr 2014/12/01 23:55:14 Error response from daemon: Cannot destroy container a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1: Driver aufs failed to remove root filesystem a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1: rename /tmp/docker/aufs/mnt/a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1 /tmp/docker/aufs/mnt/a06676671e3203bd355b6597aa65ba062da22ce30e40397cde889651d5b4fce1-removing: device or resource busy: 1 qr1hi-8i9sb-5wqnicleczswhxb 28048 0 failure (#1, permanent) after 2 seconds: 1 qr1hi-8i9sb-5wqnicleczswhxb 28048 Every node has failed -- giving up on this round: 1 qr1hi-8i9sb-wepti5u701eu818 13190 0 failure (#1, permanent) after 1 seconds: 1 qr1hi-8i9sb-3xg2nfzxx6vf6zh 18681 19 stderr 2014/12/01 23:52:25 Error response from daemon: Cannot destroy container 8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503: Driver aufs failed to remove root filesystem 8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503: rename /tmp/docker/aufs/mnt/8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503 /tmp/docker/aufs/mnt/8ed72be62441620ae4d36053ef3c6da8801878785dfeae997a4bb34f17439503-removing: device or resource busy: 1
Updated by Tim Pierce almost 10 years ago
At commit d71422fa, this still needs some work to provide more failure reasons, and some better formatting but the basics are here:
(pysdk)hitchcock:/home/twp/arvados/services/api/script% python crunch-failure-report.py --help usage: crunch-failure-report.py [-h] [--start START] [--end END] Produce a report of Crunch failures within a specified time range optional arguments: -h, --help show this help message and exit --start START Start date and time --end END End date and time (pysdk)hitchcock:/home/twp/arvados/services/api/script% python crunch-failure-report.py --start=2014-12-29T22:00:00Z Start: 2014-12-29T22:00:00Z End: 2014-12-30T19:41:48Z Overview Started: 27 Successful: 16 (59%) Failed: 10 (37%) In progress: 1 ( 4%) Failures by class unknown 10 (100%) User not found on host 1 (10%) Failures by class (detail): unknown 10 (100%) http://crvr.se/qr1hi-8i9sb-h8izsi22winuktc http://crvr.se/qr1hi-8i9sb-lulm7egeadtak9p http://crvr.se/qr1hi-8i9sb-ffrgm644kycjm2d http://crvr.se/qr1hi-8i9sb-o039nu58e3pn7c6 http://crvr.se/qr1hi-8i9sb-4tbab0tch69bl5h http://crvr.se/qr1hi-8i9sb-jkg3xb3kvcrkvde http://crvr.se/qr1hi-8i9sb-mhluou03soxe8f5 http://crvr.se/qr1hi-8i9sb-5tbv8nsti6vdmic http://crvr.se/qr1hi-8i9sb-f38dnwpkyasuw0e http://crvr.se/qr1hi-8i9sb-2ba81yr8tnxbtx2 User not found on host 1 (10%) http://crvr.se/qr1hi-8i9sb-h8izsi22winuktc
Updated by Tim Pierce almost 10 years ago
now at 1ee492f3, this version knows about different failure types and knows about short names like "sys/docker".
Ward, I think you had some strong desires for the output formatting of this report and ordering of failures which I didn't see captured in the story. Let me know what the specifics are here.
Updated by Ward Vandewege almost 10 years ago
Reviewing 1ee492f33846d35b4ead20fbdbbc3b496719bd86
a) I've updated the report description with a bit more detail about the desired failure by class (detail) with regard to sorting, and a few extra output fields. And I dropped the url part of the uuid output.
b) I ran the report on qr1hi for the past week, and it worked. Two problems:
- sorting by descending number of occurrences is not working in the failures by class sections
- unknown is not 100%, that is wrong, it's 94%
Failures by class sys/docker 4 ( 4%) unknown 101 (100%) crunch/node 1 ( 1%)
c) I ran the report on qr1hi for the past month, and it bombed out:
$ python crunch-failure-report.py --start '2014-12-01T00:00:00Z' --end '2014-12-31T23:59:59Z' Traceback (most recent call last): File "crunch-failure-report.py", line 169, in <module> sys.exit(main()) File "crunch-failure-report.py", line 99, in main logs = job_logs(api, job) File "crunch-failure-report.py", line 69, in job_logs log_collection = arvados.CollectionReader(job['log'], api) File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 127, in __init__ "Argument to CollectionReader must be a manifest or a collection UUID") arvados.errors.ArgumentError: Argument to CollectionReader must be a manifest or a collection UUID
d) formatting details:
Looking at the run output below, please fix these things:
- no colons after Started/Successful/...
- fix horizontal alignment under failures by class, also allow for longer failure class names
- no colon after "Failures by class (detail)"
Start: 2014-12-16T00:00:00Z End: 2014-12-31T23:59:59Z Overview Started: 372 Successful: 250 (67%) Failed: 101 (27%) In progress: 21 ( 6%) Failures by class sys/docker 4 ( 4%) unknown 101 (100%) crunch/node 1 ( 1%) Failures by class (detail): sys/docker 4 ( 4%) http://crvr.se/qr1hi-8i9sb-1lh08gfnmcoqx9r http://crvr.se/qr1hi-8i9sb-s6t07tokevpjg28 http://crvr.se/qr1hi-8i9sb-ch7olbbw1nh7u8r http://crvr.se/qr1hi-8i9sb-v949by9i93ivf2f unknown 101 (100%) http://crvr.se/qr1hi-8i9sb-0vhlqfabrjycw7o http://crvr.se/qr1hi-8i9sb-2x534aipm08ki75 http://crvr.se/qr1hi-8i9sb-xyqnaboeq0ji440 http://crvr.se/qr1hi-8i9sb-khrdlhdwenanqgw http://crvr.se/qr1hi-8i9sb-by9rqblusoydhm7 http://crvr.se/qr1hi-8i9sb-2099gt31wvzmzm0 http://crvr.se/qr1hi-8i9sb-melwo91k6vvjaow http://crvr.se/qr1hi-8i9sb-mkoxrkxhpxlbn63 http://crvr.se/qr1hi-8i9sb-482dzijmokax2xi http://crvr.se/qr1hi-8i9sb-tzosb2voqbsvwtw http://crvr.se/qr1hi-8i9sb-v949by9i93ivf2f http://crvr.se/qr1hi-8i9sb-61pkx5azm8b5tj5 http://crvr.se/qr1hi-8i9sb-5swvl2s3b2t1gg9 http://crvr.se/qr1hi-8i9sb-z4jvf722g7b27m7 http://crvr.se/qr1hi-8i9sb-dkuzbfnkx1o5mc9 http://crvr.se/qr1hi-8i9sb-6iknziykdpuchx8 http://crvr.se/qr1hi-8i9sb-1da5h4bem5lz3pq http://crvr.se/qr1hi-8i9sb-honjl0u5dozf94p http://crvr.se/qr1hi-8i9sb-vsnbi9p7zk2hhhu http://crvr.se/qr1hi-8i9sb-df4rmrwh7xfmr07 http://crvr.se/qr1hi-8i9sb-01q7xqilnbz9au5 http://crvr.se/qr1hi-8i9sb-6t21aff3ui9cs9l http://crvr.se/qr1hi-8i9sb-f82832fcep7c0o3 http://crvr.se/qr1hi-8i9sb-3daeal2024jg5pr ...
Updated by Tim Pierce almost 10 years ago
Fixes at e5dce69:
Bugs fixed:- Correct counting and percentage calculation of job failures.
- Jobs were getting categorized as both "unknown" and as a specific failure type.
- Crashes fixed: should not raise any unhandled exceptions.
- Itemized failures are now sorted in descending order by failure type
- Better horizontal alignment
- Modified formatting to account for updated description.
Updated by Ward Vandewege almost 10 years ago
Review comments:
Sorting is still not working in failures by class and failures by class (detail). For example, from the 2014-12 report on qr1hi:
Failures by class sys/docker 34 ( 14%) unknown 204 ( 85%) crunch/node 1 ( 0%)
Another bug (?):
Start: 2014-12-01T00:00:00Z End: 2015-01-01T00:00:00Z Overview Started 750 Successful 473 ( 63%) Failed 239 ( 32%) In progress 38 ( 5%)
It seems hard to believe there would be 38 jobs still in progress that started in December 2014.
Updated by Tim Pierce almost 10 years ago
Fixed at a686dcb:
The script failed to take into account jobs in a Queued or Cancelled state. It now reports on all five job states explicitly (Complete, Failed, Queued, Cancelled, Running).
The code to sort the failure states has been corrected.
Updated by Ward Vandewege almost 10 years ago
Review comments:
$ ./crunch-failure-report.py Start: 2015-01-05T19:25:33Z End: 2015-01-06T19:25:33Z Overview Started 59 Complete 30 ( 51%) Failed 29 ( 49%) Traceback (most recent call last): File "./crunch-failure-report.py", line 219, in <module> sys.exit(main()) File "./crunch-failure-report.py", line 203, in main job_name = job_pipeline_name(api, job_info['uuid']) File "./crunch-failure-report.py", line 107, in job_pipeline_name job_pipeline_names[job_uuid] = _lookup_pipeline_name(api, job_uuid) File "./crunch-failure-report.py", line 101, in _lookup_pipeline_name pt = api.pipeline_templates().get(uuid=pi['pipeline_template_uuid']).execute() File "/usr/local/lib/python2.7/dist-packages/apiclient/discovery.py", line 583, in method raise TypeError('Missing required parameter "%s"' % name) TypeError: Missing required parameter "uuid"
Assuming that the problem above is not a sdk version problem, please do not return this branch for review before you can
a) run ./crunch-failure-report.py without arguments on qr1hi
b) run ./crunch-failure-report.py for the entire month of 2014-12 on qr1hi
The report for the month of December looks good.
Updated by Tim Pierce almost 10 years ago
Updated at 4b9208f2b with more comprehensive exception handling. (If a failed pipeline instance has no pipeline_template_uuid, it will be rendered with a blank pipeline name in the failure details.)
Ward, I'm sorry for failing to check each push more comprehensively. I have tested this version with:
- no arguments (this defaulted to
--start=2015-01-05T20:59:51Z --end=2015-01-06T20:59:51Z
) --start=2014-12-01T00:00:00Z --end=2015-01-01T00:00:00Z
--start=2014-11-01T00:00:00Z --end=2014-12-01T00:00:00Z
--start=2014-11-01T00:00:00Z
and no --end parameter (for the maximum possible coverage of testing all job states and all known failure types at once).
In each case I checked that the error types are sorted consistently from most common to least common, that the job counts and error counts add up consistently, and that the percentages reported match the job counts listed.
Updated by Ward Vandewege almost 10 years ago
Looking good now. I think this can be merged. Thanks!
Updated by Tim Pierce almost 10 years ago
I made one last change: renaming crunch-failure-report.py
to crunch_failure_report.py
, which at least permits importing it and therefore (someday) tests to be added.
Merged and pushed. Thanks for your feedback and patience.
Updated by Tim Pierce almost 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:a32c4f9997a0c8941b62668c5e59941985359c05.