Bug #12891

[crunch2] log collection not saved for cancelled job

Added by Bryan Cosca almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/22/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Bug 1

In some cases, after a container is cancelled, the container's log field and the container request's log_uuid are both null. This makes it impossible for crunchstat-summary to analyze the portion of the job which did run. Example: su92l-xvhdp-4j98m0zgu9xst51

The intended/expected behavior is for the log to be saved in Keep, and accessible via container_request.log_uuid, just as it is when the container exits non-zero.

Bug 2

crunchstat-summary should be able to do some analysis on containers that haven't saved a log to Keep (because they're still running, or because of a bug like this). It can do this with crunch1 jobs.

Bug 1 analysis

crunch-run tries to save a log file after the container ends, regardless of final state, but sometimes this doesn't work.

crunch-dispatch-slurm cancels the slurm job as soon as it notices the container priority is zero. crunch-run catches SIGTERM and tries to write the buffered output and logs, but (according to sample logs) gives up 30-40 seconds later if writing the partial output/logs takes longer than that -- because slurm kills crunch-run with SIGKILL when the slurm KillWait timer (30s here) runs out.

crunch-dispatch-slurm is using scancel to notify crunch-run that it should terminate the container and cancel gracefully. The KillWait behavior is not desirable. Instead of using plain "scancel", which gets impatient and kills crunch-run forcefully, we want "scancel --signal=TERM", which merely sends the given signal to crunch-run.


Subtasks

Task #12972: Review 12891-log-on-cancelResolvedTom Clegg


Related issues

Related to Arvados - Bug #12916: [Documentation] container_requests methodsResolved

Has duplicate Arvados - Bug #12893: [Crunch2] Logs should be saved to disk when container is cancelledDuplicate

Associated revisions

Revision b50ea346
Added by Tom Clegg over 3 years ago

Merge branch '12891-log-on-cancel'

refs #12891

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Morris almost 4 years ago

  • Subject changed from crunchstat-summary should be able to read partial container/container request logs to [crunch2] log collection not updated for cancelled jobs
  • Description updated (diff)
  • Target version set to To Be Groomed

#2 Updated by Tom Morris almost 4 years ago

  • Has duplicate Bug #12893: [Crunch2] Logs should be saved to disk when container is cancelled added

#3 Updated by Tom Morris almost 4 years ago

  • Tracker changed from Feature to Bug
  • Description updated (diff)

#4 Updated by Sarah Zaranek almost 4 years ago

Workaround provided by Peter:

arv log list --order "created_at asc" --filters '[["object_uuid", "=",
"su92l-dz642-fvn104rkw6hq1u0"], ["event_type", "=", "crunch-run"]]' |
jq -r .items[].properties.text

Can also try event_type "stdout", "stderr", "crunchstat" and "arv-mount"

#5 Updated by Tom Clegg almost 4 years ago

  • Description updated (diff)

#7 Updated by Tom Morris almost 4 years ago

  • Target version changed from To Be Groomed to 2018-01-31 Sprint

#8 Updated by Tom Clegg almost 4 years ago

  • Related to Bug #12916: [Documentation] container_requests methods added

#9 Updated by Tom Clegg almost 4 years ago

  • Assigned To set to Tom Clegg

#10 Updated by Tom Clegg almost 4 years ago

  • Status changed from New to In Progress
  • crunch-dispatch-slurm cancels the slurm job as soon as it notices the container is cancelled. crunch-run catches SIGTERM and tries to write the buffered output and logs, but (according to sample logs) seems to give up 30-40 seconds later without actually writing them.

(this should be "...notices the container has priority=0")

As for "seems to give up 30-40 seconds later", KillWait=30 in su92l's slurm config.

"To cancel a job, invoke scancel without --signal option. This will send first a SIGCONT to all steps to eventually wake them up followed by a SIGTERM, then wait the KillWait duration defined in the slurm.conf file and finally if they have not terminated send a SIGKILL. This gives time for the running job/step(s) to clean up." -- slurm docs

(Note "time to clean up" is not necessarily enough time to clean up.)

  • even if crunch-run gets that far, it seems apiserver would refuse to update the output or log field of a container whose state is Cancelled.

(It would be helpful to check server logs to determine whether crunch-run is in fact getting as far as calling collections#create and containers#update.)

In the reported example, crunch-run did not get this far.

There are a few other oddities here, like
  • crunch-run ignores everything after the first signal, even if there was an error trying to stop the container in response to the first signal
  • crunch-run uses ContainerStop with a 10 nanosecond timeout, which means "send TERM; if still alive in 10ns then send KILL". This was probably intended to be 10 seconds, which is the default value for the "docker container stop" CLI command. But we pass 10ns to the docker client library, which rounds it down to "t=0" for the docker API call. (Does docker take this to mean "no timeout" or "timeout immediately"?)
    • crunch-run doesn't log anything when ContainerStop succeeds so we don't know how long that docker API call took (it might have taken 0s, 10s, 30s, ...).
    • crunch-run didn't log either "Container exited with code: ..." or "container wait: ...", which seems to mean ContainerWait() didn't return within KillWait=30 seconds of the first scancel.
    • https://github.com/moby/moby/issues/34879 says short timeouts (like 0) can cause "docker stop" to take ~60 seconds, even though the container stops quickly. The stop() function holds the cStateLock until ContainerStop() returns, although even that doesn't seem to explain why crunch-run didn't say "Container exited with code: ..." within 30 seconds.

...but the basic problem is that crunch-run has a slurm-config-determined amount of time to write logs and output, after which it's killed with no chance to report anything. For most programs this at least ensures jobs can be killed even if they're ignoring signals, but not in this case, because crunch-run isn't the only thing that needs to be killed. If crunch-run is still alive because it's trying to kill the docker container, killing crunch-run seems final from slurm's perspective, but it doesn't account for the docker container, which can then stay alive/running for a long time without being visible to slurm or Arvados.

According to experiments, calling scancel --signal=TERM on a queued job waits for the job to start, then sends TERM. This seems to mean the appropriate choice of scancel command depends on whether the slurm job is running:
  • "scancel --name=$uuid --signal=TERM --state=running" ...if the job is running. "Notify crunch-run that we want to stop/cancel the container."
  • "scancel --name=$uuid --signal=TERM --state=suspended" ...if the job is suspended. Unless/until we're ready to implement "suspend" by pausing the container, we should probably be ignoring SIGSTOP (slurm will send a STOP signal but currently that will just pause crunch-run, not the container).
  • "scancel --name=$uuid --state=pending" ...if the slurm job is not running yet (and again with --state=suspended for good measure?). "Tell slurm not to bother starting our job".

#11 Updated by Tom Clegg almost 4 years ago

  • Subject changed from [crunch2] log collection not updated for cancelled jobs to [crunch2] log collection not saved for cancelled job
  • Description updated (diff)

#12 Updated by Tom Clegg almost 4 years ago

docker stop implementation looks like timeout=0 means "send SIGTERM, then immediately send SIGKILL, then wait forever for container to stop".

#13 Updated by Tom Clegg over 3 years ago

12891-log-on-cancel @ 8758f3ddc81b9ef9aaab111eb331e452c4ec7de5

Test run 9tee4-xvhdp-20xohwwgqv4e382:

2018-01-22T17:32:05.003623945Z Attaching container streams
2018-01-22T17:32:05.327258245Z Starting Docker container id '4de05e8f838fec6595a4d12ab10da215d25ff5b2e9bcb01982bbb5544f14a91c'
2018-01-22T17:32:05.849889035Z Waiting for container to finish
2018-01-22T17:32:05.627904905Z notice: reading stats from /sys/fs/cgroup/cpuacct//slurm/uid_0/job_5726/step_4294967294/4de05e8f838fec6595a4d12ab10da215d25ff5b2e9bcb01982bbb5544f14a91c/cgroup.procs
2018-01-22T17:32:05.628018890Z notice: reading stats from /sys/fs/cgroup/memory//slurm/uid_0/job_5726/step_4294967294/4de05e8f838fec6595a4d12ab10da215d25ff5b2e9bcb01982bbb5544f14a91c/memory.stat
2018-01-22T17:32:05.628304205Z mem 0 cache 0 swap 0 pgmajfault 2277376 rss
2018-01-22T17:32:05.628359477Z notice: reading stats from /sys/fs/cgroup/cpuacct//slurm/uid_0/job_5726/step_4294967294/4de05e8f838fec6595a4d12ab10da215d25ff5b2e9bcb01982bbb5544f14a91c/cpuacct.stat
2018-01-22T17:32:05.628472527Z notice: reading stats from /sys/fs/cgroup/cpuset//slurm/uid_0/job_5726/step_4294967294/4de05e8f838fec6595a4d12ab10da215d25ff5b2e9bcb01982bbb5544f14a91c/cpuset.cpus
2018-01-22T17:32:05.628520158Z cpu 0.0100 user 0.0000 sys 20 cpus
2018-01-22T17:32:06.989658455Z caught signal: terminated
2018-01-22T17:32:06.989729840Z removing container
2018-01-22T17:32:07.450846030Z Container exited with code: 137
2018-01-22T17:32:08.031838047Z Cancelled
2018-01-22T17:32:08.326451146Z caught signal: terminated
2018-01-22T17:32:08.734333019Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/keep425513246]
2018-01-22T17:32:09.737506706Z crunch-run finished
As this log shows:
  • crunch-dispatch-slurm sends SIGTERM to crunch-run
  • crunch-run hears (and logs) multiple SIGTERM, not just the first one
  • force-removing the container results in "Container exited with code: 137" (aka SIGKILL)
  • we can see the timing between "removing container" and "container exited" (0.46s here)
  • crunch-run cleans up (arv-mount) and writes a log (9tee4-4zz18-eqa242udx9za0lk) after killing the container
Tested but not shown:
  • cancel a container after submitting to slurm, but before slurm is ready to run it
Not tested, but seems convincing:
  • now that we don't get impatient and send SIGKILL to crunch-run, it can stay alive longer than 30 seconds if it takes that long to write a log collection

#15 Updated by Peter Amstutz over 3 years ago

-        // Suppress multiple calls to stop()
-        runner.cStarted = false

So the reason that was there is because of this:

    go func() {
        <-runner.ArvMountExit
        if runner.cStarted {
            runner.CrunchLog.Printf("arv-mount exited while container is still running.  Stopping container.")
            runner.stop()
        }
    }()

If we don't set cStarted to false (or otherwise adjust the logic) it will intermittently print out an error and call stop() when arv-mount terminates, even during normal shutdown.

#16 Updated by Tom Clegg over 3 years ago

Aha. Eliminated cStarted entirely, by adding "arv-mount exited first" as a 3rd possibility in the 2-way select.

Fixed some possible races in the test suite while updating to use "ContainerWait() was called" instead of "cStarted is true" as the trigger for the "stop container" tests.

12891-log-on-cancel @ e6fa5ac95cff84a7b00dd8ab32349d88473400da

#17 Updated by Peter Amstutz over 3 years ago

Tom Clegg wrote:

Aha. Eliminated cStarted entirely, by adding "arv-mount exited first" as a 3rd possibility in the 2-way select.

Fixed some possible races in the test suite while updating to use "ContainerWait() was called" instead of "cStarted is true" as the trigger for the "stop container" tests.

12891-log-on-cancel @ e6fa5ac95cff84a7b00dd8ab32349d88473400da

This LGTM.

#18 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF