Project

General

Profile

Actions

Bug #12891

closed

[crunch2] log collection not saved for cancelled job

Added by Bryan Cosca over 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Bug 1

In some cases, after a container is cancelled, the container's log field and the container request's log_uuid are both null. This makes it impossible for crunchstat-summary to analyze the portion of the job which did run. Example: su92l-xvhdp-4j98m0zgu9xst51

The intended/expected behavior is for the log to be saved in Keep, and accessible via container_request.log_uuid, just as it is when the container exits non-zero.

Bug 2

crunchstat-summary should be able to do some analysis on containers that haven't saved a log to Keep (because they're still running, or because of a bug like this). It can do this with crunch1 jobs.

Bug 1 analysis

crunch-run tries to save a log file after the container ends, regardless of final state, but sometimes this doesn't work.

crunch-dispatch-slurm cancels the slurm job as soon as it notices the container priority is zero. crunch-run catches SIGTERM and tries to write the buffered output and logs, but (according to sample logs) gives up 30-40 seconds later if writing the partial output/logs takes longer than that -- because slurm kills crunch-run with SIGKILL when the slurm KillWait timer (30s here) runs out.

crunch-dispatch-slurm is using scancel to notify crunch-run that it should terminate the container and cancel gracefully. The KillWait behavior is not desirable. Instead of using plain "scancel", which gets impatient and kills crunch-run forcefully, we want "scancel --signal=TERM", which merely sends the given signal to crunch-run.


Subtasks 1 (0 open1 closed)

Task #12972: Review 12891-log-on-cancelResolvedTom Clegg01/22/2018Actions

Related issues

Related to Arvados - Bug #12916: [Documentation] container_requests methodsResolvedActions
Has duplicate Arvados - Bug #12893: [Crunch2] Logs should be saved to disk when container is cancelledDuplicateActions
Actions

Also available in: Atom PDF