Bug #4889

[Crunch] Logging with a progress bar deadlocks on output

Added by Bryan Cosca over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
Crunch
Target version:
Start date:
12/30/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

I ran the cufflinks tool: http://cole-trapnell-lab.github.io/cufflinks/manual/ and tools logs its own progress by creating a loading bar and updating that loading bar by the millisecond(?). I ran this job: qr1hi-8i9sb-ajgkcn4cuk4cntm and it last logged at the part before the loading bar would come up. The job then ran ad infinitum and I had to cancel it. (The job ran 20 minutes here: qr1hi-8i9sb-pmu3ffh19dny625 using a --quiet flag)

Possible solution: use communicate.Popen instead of subprocess.call. https://docs.python.org/2/library/subprocess.html#subprocess.call "Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes."

logging_addendum_2.gif (247 KB) logging_addendum_2.gif Bryan Cosca, 12/30/2014 04:40 PM
logging_addendum_3.gif (245 KB) logging_addendum_3.gif Bryan Cosca, 12/30/2014 04:47 PM

Subtasks

Task #5042: Review 4889-crunch-deadlockResolvedTim Pierce

Associated revisions

Revision 38e27663
Added by Tim Pierce about 5 years ago

Merge branch '4889-crunch-deadlock'

Fixes #4889.

History

#1 Updated by Bryan Cosca over 5 years ago

#2 Updated by Bryan Cosca over 5 years ago

  • Subject changed from Logging with a progress bar to Cufflinks: Logging with a progress bar

#3 Updated by Tim Pierce over 5 years ago

  • Subject changed from Cufflinks: Logging with a progress bar to [Crunch] Logging with a progress bar deadlocks on output
  • Description updated (diff)
  • Category set to Crunch
  • Target version set to Bug Triage

#4 Updated by Tim Pierce over 5 years ago

  • Target version changed from Bug Triage to 2015-01-28 Sprint

#5 Updated by Tim Pierce over 5 years ago

  • Assigned To set to Tim Pierce
  • Story points set to 1.0

#6 Updated by Tim Pierce about 5 years ago

  • Description updated (diff)

#7 Updated by Tim Pierce about 5 years ago

  • Description updated (diff)

#8 Updated by Tim Pierce about 5 years ago

So subprocess.Popen.communicate() has the interesting effect that it's really only intended to communicate with a single subprocess. e.g.

pipe = subprocess.Popen('/bin/foo', '-a', '--lots-of-output', stdin=subprocess.PIPE)
(stdoutdata, stderrdata) = pipe.communicate()

The communicate() method sends whatever input data is necessary to the subprocess for the pipe and then collects output and error data, returning only when the process has terminated. If the output of this pipe goes to another pipe, you're out of luck -- communicate() doesn't really know how to manage multiple pipes at once. See http://stackoverflow.com/questions/6341451/piping-together-several-subprocesses

(However, if the first process in the pipeline doesn't require anything on standard input, we may be able to get away with just calling .communicate() on the last process in the pipeline. Let's try.)

One of the answers to that Stackoverflow suggests using https://docs.python.org/2/library/pipes.html which includes this reassuring note:

Because the module uses /bin/sh command lines, a POSIX or compatible shell for os.system() and os.popen() is required.

#9 Updated by Tom Clegg about 5 years ago

Hm, are we using communicate() someplace where this progress bar could be going? source:crunch_scripts/run-command#L350 seems to be using Popen already. And if we were running commands with communicate(), we wouldn't be accustomed to seeing their stderr until they exited, right?

The crunchstat messages show CPU usage dropping off suddenly at about 22 minutes in, followed by the disappearance of the task's docker container ("reading stats from /sys/fs/cgroup/memory/memory.stat"). Is it possible that docker died of line buffering? Or that docker died of unrelated causes, and re-running the job with the progress bar would have worked just as well? It also seems possible it's crunchstat's fault (see CopyPipeToChan), but the container disappearance seems like a clue worth investigating. "Lots of stderr with no \n" seems like an easy crunch script to write & add to our Pathological Scripts repository.

In the meantime, another workaround suggestion: cufflinks has a --verbose flag, for "log-friendly verbose processing". Possibly more appropriate than --quiet here.

#10 Updated by Tim Pierce about 5 years ago

Have not been able to reproduce this problem by running a script generating 70MB of output without a newline as a local job:

#! /usr/bin/env python

# Test #4889: "Logging with a progress bar deadlocks on output" 

import sys

for i in range(0,70000000):
    sys.stdout.write('x')

... run with this job definition:

{
  "repository": "",
  "script": "test_4889.py",
  "script_version": "/home/twp/test4889",
  "script_parameters": {},
  "runtime_constraints": {
    "docker_image": "bcosc/lobster" 
  }
}

When this script is run as a local job, the job terminates within a few minutes (with a "failed" status) and does not deadlock.

I have attempted to rerun the original failing job on qr1hi, but reproducing that will require fixing #5006 first.

#11 Updated by Tim Pierce about 5 years ago

The original job continues to deadlock in production, e.g. qr1hi-8i9sb-6ja93cyxd7mjm1r

I will try reproducing with a garbage-generating dummy script run on a compute node and not locally.

#12 Updated by Tim Pierce about 5 years ago

Job qr1hi-8i9sb-7b7seh3jgk8zoe1 doesn't deadlock either (Python script generating 16MiB of output, no newlines, in the same Docker image as used for the original job)

#13 Updated by Tim Pierce about 5 years ago

  • Status changed from New to In Progress

#14 Updated by Tim Pierce about 5 years ago

Successfully reproduced in qr1hi-d1hrv-dftjw06d6rkjc2w. The problem is that the process deadlocks on the stderr buffer, not on stdout:

(pysdk)hitchcock:/home/twp/twp-qr1hi/crunch_scripts% git show 57c73203:crunch_scripts/test_4889.py
#! /usr/bin/env python

# Test #4889: "Logging with a progress bar deadlocks on output" 
#
# Theory: the crunch job deadlocks because it waits for the job
# to terminate before reading output, and certain jobs fill up
# the output line buffer by producing huge amounts of data
# without a newline.

import sys

for i in range(0,16000000):
  sys.stderr.write('x')

#15 Updated by Tim Pierce about 5 years ago

And this looks likely to be the culprit: source:services/crunchstat/crunchstat.go@5870334b03237f04df6de35193f78b838f7ac7dc#L38

crunchstat collects stderr output from the container with CopyPipeToChan, which reads stderr data with a bufio.Scanner that defaults to reading records line-by-line.

The documentation for bufio.NewScanner claims that it will stop scanning upon encountering "a token too large for the buffer," which suggests that it should actually terminate on this case and not hang this way. I'll see if I can reproduce on the command line, or if I can eliminate the issue by using a bufio.Reader.

#16 Updated by Tim Pierce about 5 years ago

Ready for review at 4c731f759.

It's a workaround, not a comprehensive fix: close the input pipe before returning, and if bufio.Scanner.Scan() returns with a non-nil error message, report that to the log channel. Includes a unit test.

#17 Updated by Tom Clegg about 5 years ago

This fixes the "stderr goes silent" problem and (probably?) kills the child by closing the pipe, which is definitely better than deadlocking the child.

I think the error message should be prefixed with "crunchstat: " like crunchstat's other logs, so it's more obvious (to someone who isn't already familiar with crunchstat internals) who's saying it. Suggest:
  • ...("crunchstat: line-buffering error: %s", s.Err())

Other than that, lgtm

#18 Updated by Anonymous about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:38e27663cf656f0c9c443a2715f249afe39a8bfb.

Also available in: Atom PDF