Bug #11170

Stale squeue processes on c97qk caused by "crunch-dispatch --jobs"

Added by Ward Vandewege 5 months ago. Updated 4 months ago.

Status:ResolvedStart date:03/23/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:

100%

Category:-
Target version:2017-03-29 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

There were a total of 4694 of these processes, representing a significant resource leak.

# ps auxwf 

...
root      1339  0.0  0.0    196    32 ?        Ss    2016   0:52 runsvdir -P /etc/service log: ......................................................................................................
.....................................................................................................................................................................................................
................................................................................................
root      1357  0.0  0.0    176    32 ?        Ss    2016   0:00  \_ runsv crunch-dispatch-jobs-0
root      1433  0.0  0.0    192    48 ?        S     2016   0:57  |   \_ svlogd -tt /etc/sv/crunch-dispatch-jobs-0/log/main
root     46325  7.4  1.9 461112 138060 ?       Sl   Feb25 177:18  |   \_ ./script/crunch-dispatch.rb --jobs                                                                                          

root     46919  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     47929  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     48991  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     49948  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     51172  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     52131  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
root     53174  0.0  0.0      0     0 ?        Z    Feb25   0:00  |       \_ [squeue] <defunct>
...
root      3988  0.0  0.0      0     0 ?        Z    18:52   0:00  |       \_ [squeue] <defunct>
root      5015  0.0  0.0      0     0 ?        Z    18:53   0:00  |       \_ [squeue] <defunct>
root      6157  0.0  0.0      0     0 ?        Z    18:54   0:00  |       \_ [squeue] <defunct>
root      7388  0.0  0.0      0     0 ?        Z    18:55   0:00  |       \_ [squeue] <defunct>
root      8527  0.0  0.0      0     0 ?        Z    18:56   0:00  |       \_ [squeue] <defunct>
root      9515  0.0  0.0      0     0 ?        Z    18:57   0:00  |       \_ [squeue] <defunct>
root     10627  0.0  0.0      0     0 ?        Z    18:58   0:00  |       \_ [squeue] <defunct>
root     11657  0.0  0.0      0     0 ?        Z    18:59   0:00  |       \_ [squeue] <defunct>
root     12996  0.0  0.0      0     0 ?        Z    19:00   0:00  |       \_ [squeue] <defunct>
root     14366  0.0  0.0      0     0 ?        Z    19:02   0:00  |       \_ [squeue] <defunct>
root     14676  0.0  0.0  10468  2192 pts/0    S+   19:02   0:00          \_ grep --color=auto squeu
c97qk:~# ps auxwf |grep squeu |wc
   4695   65730  450725


Subtasks

Task #11269: Review 11170-stale-squeue-procsResolvedPeter Amstutz

Associated revisions

Revision 83203f5c
Added by Lucas Di Pentima 4 months ago

Merge branch '11170-stale-squeue-procs'
Closes #11170

History

#1 Updated by Ward Vandewege 5 months ago

  • Description updated (diff)

#2 Updated by Ward Vandewege 5 months ago

  • Subject changed from stale squeue processes on c97qk to stale squeue processes on c97qk caused by crunch-dispatch --jobs

#3 Updated by Tom Morris 4 months ago

  • Project changed from OPS to Arvados
  • Subject changed from stale squeue processes on c97qk caused by crunch-dispatch --jobs to Stale squeue processes on c97qk caused by "crunch-dispatch --jobs"
  • Description updated (diff)
  • Target version set to 2017-03-29 sprint

#4 Updated by Lucas Di Pentima 4 months ago

  • Assignee set to Lucas Di Pentima

#5 Updated by Lucas Di Pentima 4 months ago

  • Status changed from New to In Progress

#6 Updated by Lucas Di Pentima 4 months ago

Updated at branch 11170-stale-squeue-procs - f31475d
Test run: https://ci.curoverse.com/job/developer-run-tests/195/

Used Process::detach on both File.popen(...) cases so that the process status get collected by a separate thread on completion.
Ref: https://ruby-doc.org/core-2.1.1/Process.html#method-c-detach

#7 Updated by Peter Amstutz 4 months ago

squeue_jobs and scancel should use the block form of IO.popen() so that it is closed automatically. See stdout_s

#9 Updated by Lucas Di Pentima 4 months ago

New updates at 077878d
Test run: https://ci.curoverse.com/job/developer-run-tests/197/

I've updated the tests so they stub the IO class instead of File.

#10 Updated by Peter Amstutz 4 months ago

Can we get

      p = IO.popen(['squeue', '-a', '-h', '-o', '%j'])
      begin
        l = p.readlines.map {|line| line.strip}
      ensure
        p.close
      end

#11 Updated by Lucas Di Pentima 4 months ago

Done: 2741b54

#12 Updated by Peter Amstutz 4 months ago

Lucas Di Pentima wrote:

Done: 2741b54

LGTM

#13 Updated by Lucas Di Pentima 4 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:83203f5c739ee0b0199e76babccb60e832a0de8e.

Also available in: Atom PDF