Bug #7263

[Crunch] crunch-job does not cancel jobs consistently

Added by Peter Amstutz about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
01/23/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Original bug report

c97qk-8i9sb-rjmjzrz54gf61os on c97qk was cancelled, but crunch-job doesn't notice and lets the job continue running.

We also saw this recently with a job on su92l.

Diagnosis I

The main loop of crunch-job has this code:

    my $gotsome
    = readfrompipes ()
    + reapchildren ();
    if (!$gotsome)
    {
      check_refresh_wanted();
      check_squeue();
      update_progress_stats();
      select (undef, undef, undef, 0.1);
    }

check_refresh_wanted is the function that will notice if the job has been canceled and the child processes signaled accordingly. If the cancel comes in while we're in this loop, we're unlikely to notice it as long as we keep receiving data from the child processes—which seems pretty likely now that we have crunchstat in the mix.

This diagnosis is not a 100% sure thing, because the pipes we're reading from are opened in nonblocking mode. So they would have to be very busy to keep avoiding this branch. But it's at least still possible, particularly since this problem seems to strike jobs that have many parallel tasks.

Diagnosis II

See Fix II below.

Fix I

Suggest checking refresh_trigger (and the other things) every ~two seconds, even during times when readfrompipes() and reapchildren() are keeping us 100% busy.

-    if (!$gotsome) {
+    if (!$gotsome || $latest_refresh + 2 < scalar time) {

After this is merged and deployed, test whether you can successfully cancel massively parallel jobs. If it works consistently, I think we can mark this ticket resolved. Otherwise, we'll need to do another pass on this.

Fix II

In source:sdk/cli/bin/crunch-job, do not stay in this loop forever:
  •     while (0 < sysread ($reader{$job}, $buf, 8192))
        {
          print STDERR $buf if $ENV{CRUNCH_DEBUG};
          $jobstep[$job]->{stderr_at} = time;
          $jobstep[$job]->{stderr} .= $buf;
          preprocess_stderr ($job);
          if (length ($jobstep[$job]->{stderr}) > 16384)
          {
            substr ($jobstep[$job]->{stderr}, 0, 8192) = "";
          }
          $gotsome = 1;
        }
    
  • Consider raising 8192 -- perhaps 65536.
  • Consider changing that "while" to an "if".

Subtasks

Task #8261: Review 7263-better-busy-behaviorResolvedPeter Amstutz

Task #8459: Review last commit 0f7709af on wtsi-hgi:hgi/7263-even-better-busy-behaviorResolvedTom Clegg

Associated revisions

Revision 01fbbc4a (diff)
Added by Brett Smith about 6 years ago

7263: crunch-job checks for refreshes every two seconds.

This avoids the possibility that a constant stream of data from tasks
can prevent the job from being canceled. Refs #7263.

Revision 74bfe6e5
Added by Tom Clegg almost 6 years ago

Merge branch '7263-better-busy-behavior' refs #7263

Revision ec51cd72
Added by Tom Clegg almost 6 years ago

Merge branch '7263-batch-task-lookup' closes #7263

History

#1 Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)

#2 Updated by Brett Smith about 6 years ago

  • Subject changed from [Crunch] crunch-job not cancelling job to [Crunch] crunch-job does not cancel jobs specifically
  • Description updated (diff)

#3 Updated by Brett Smith about 6 years ago

  • Category set to Crunch
  • Target version set to Arvados Future Sprints

#4 Updated by Brett Smith about 6 years ago

  • Subject changed from [Crunch] crunch-job does not cancel jobs specifically to [Crunch] crunch-job does not cancel jobs consistently

#5 Updated by Tom Clegg about 6 years ago

Suggest checking refresh_trigger (and the other things) every ~two seconds, even during times when readfrompipes() and reapchildren() are keeping us 100% busy.

-    if (!$gotsome) {
+    if (!$gotsome || $latest_refresh + 2 < scalar time) {

(I'm not confident this will fix the issue at hand, but it's a good idea anyway.)

#6 Updated by Brett Smith about 6 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Story points set to 0.5

#7 Updated by Brett Smith about 6 years ago

  • Status changed from In Progress to Feedback
  • Assigned To set to Brett Smith

#8 Updated by Brett Smith about 6 years ago

FWIW I canceled a few jobs recently and it worked as expected. I'd want to test with a job that has lots of parallel tasks before declaring victory though.

#9 Updated by Peter Amstutz about 6 years ago

I should note that I just updated the API server Gemfile pin on arvados-cli, which previously was 3 months out of date. So the specific fix made for this ticket may or may not be in production yet.

#10 Updated by Brett Smith about 6 years ago

Peter Amstutz wrote:

I should note that I just updated the API server Gemfile pin on arvados-cli, which previously was 3 months out of date. So the specific fix made for this ticket may or may not be in production yet.

Our clusters run crunch-dispatch with CRUNCH_JOB_BIN set to point to a copy of crunch-job from the arvados-src package. So it uses the version deployed through that package, regardless of what the API server's Gemfile says.

#11 Updated by Brett Smith about 6 years ago

  • Target version deleted (Arvados Future Sprints)

#12 Updated by Brett Smith about 6 years ago

  • Status changed from Feedback to New

We have a report from a user that a job on their cluster failed to cancel. Similar circumstances: no trouble canceling jobs with few tasks, but failed to cancel a job that had more tasks than available cores.

#13 Updated by Joshua Randall about 6 years ago

Actually, the number of tasks (400) was exactly equal to the number of cores on the allocated nodes (10 nodes with 40 cores each). I'm not running any compute on the head node, so all 40 cores there are available for the API server, crunch-dispatch, etc.

I manually changed the period for crunchstat from 10s to 300s and ran it again and this time it was able to cancel the job (although it took ~4m to completely finish being cancelled and free up the nodes).

#14 Updated by Brett Smith about 6 years ago

  • Target version set to Arvados Future Sprints

#15 Updated by Tom Clegg almost 6 years ago

  • Description updated (diff)

#16 Updated by Nico César almost 6 years ago

Just what happened today as an example:

API server got rebooted (so job dispacher) but slurm controller is running in manage.wx7k5 then the state ended up as:

compute0.wx7k5:/home/nico# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite     34 drain* compute[2-35]
compute*     up   infinite    220  down* compute[36-255]
compute*     up   infinite      2  alloc compute[0-1]
crypto       up   infinite     34 drain* compute[2-35]
crypto       up   infinite    220  down* compute[36-255]
crypto       up   infinite      2  alloc compute[0-1]
compute0.wx7k5:/home/nico# squeue_long 
  JOBID PARTITION NAME     USER ST       TIME  NODES NODELIST(REASON)
    868   compute wx7k5-8i9sb-258pea56x155uqa   crunch  R   21:14:04      2 compute[0-1]

but no docker container running:

compute0.wx7k5:/home/nico# docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

the solution was

wx7k5:~# scancel 868

should we have a way to do an scancel from workbench?

#17 Updated by Peter Amstutz almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2016-02-03 Sprint

#18 Updated by Peter Amstutz almost 6 years ago

  • Assigned To changed from Brett Smith to Peter Amstutz

#19 Updated by Peter Amstutz almost 6 years ago

  • Target version changed from 2016-02-03 Sprint to Arvados Future Sprints

#20 Updated by Tom Clegg almost 6 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#21 Updated by Brett Smith almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2016-02-17 Sprint

0.5 points to implement fix II.

#22 Updated by Joshua Randall almost 6 years ago

I've merged tomclegg's 7263-better-busy-behavior with the substantial changes I've made to our crunch-job for performance improvements to address this issue.

The fixes in my branch are:

readfrompipes:
- Incorporates the "fix II" from this issue
- Also changes the loop to use `can_read` from IO::Select so that that we only loop over descriptors that are actually ready for reading rather than wasting time calling sysread on a large number of descriptors that are not ready for reading - this is a big performance improvement with large numbers of tasks in a situation where a small number of tasks are producing most of the stderr and also in the non-busy situation.

reapchildren:
- changes behavior of reapchildren from reaping a single child to actually reaping children (i.e. continues to reap children until there are no children ready to be reaped).
- fixes bug where reapchildren did not check WIFEXITED, which could cause reapchildren to mistakenly mark tasks as exited that had only stopped (on some platforms; not actually observed, but this is best practice)

check_refresh_wanted:
- Fixes the broken handling of "%proc" values as "$jobstep"when actually one needs to access "$_->{jobstep}"to get "$jobstep" from "%proc" values.

https://github.com/wtsi-hgi/arvados/compare/hgi/7263-better-busy-behavior...wtsi-hgi:hgi/7263-even-better-busy-behavior

#23 Updated by Peter Amstutz almost 6 years ago

7263-better-busy-behavior and f9d8685 looks good to me.

#24 Updated by Brett Smith almost 6 years ago

  • Status changed from New to In Progress

#25 Updated by Brett Smith almost 6 years ago

  • Target version changed from 2016-02-17 Sprint to 2016-03-02 sprint

#26 Updated by Brett Smith almost 6 years ago

  • Assigned To deleted (Tom Clegg)

#27 Updated by Tom Clegg almost 6 years ago

  • Assigned To set to Tom Clegg

#28 Updated by Tom Clegg almost 6 years ago

7263-better-busy-behavior LGTM. I've merged it into 8099-babysit-all-srun (#8099) which touched some of the same bits (including renaming $jobstep to $jobstepidx throughout to avoid some of the confusion that probably made us write these bugs in the first place) so I could try it all together on a test cluster.

#29 Updated by Tom Clegg almost 6 years ago

Now just one new commit 0f7709af on wtsi-hgi:hgi/7263-even-better-busy-behavior to review/merge, see #8575. The rest of that branch was merged into #8099, and from there into master.

#30 Updated by Tom Clegg almost 6 years ago

  • Status changed from In Progress to Resolved

#31 Updated by Tom Clegg almost 6 years ago

  • Status changed from Resolved to In Progress

#32 Updated by Tom Clegg almost 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:ec51cd72eabf0f0bbebf7cd979d1a23a1319a416.

Also available in: Atom PDF