Bug #7263
closed[Crunch] crunch-job does not cancel jobs consistently
Description
Original bug report¶
c97qk-8i9sb-rjmjzrz54gf61os on c97qk was cancelled, but crunch-job doesn't notice and lets the job continue running.
We also saw this recently with a job on su92l.
Diagnosis I¶
The main loop of crunch-job has this code:
my $gotsome = readfrompipes () + reapchildren (); if (!$gotsome) { check_refresh_wanted(); check_squeue(); update_progress_stats(); select (undef, undef, undef, 0.1); }
check_refresh_wanted
is the function that will notice if the job has been canceled and the child processes signaled accordingly. If the cancel comes in while we're in this loop, we're unlikely to notice it as long as we keep receiving data from the child processes—which seems pretty likely now that we have crunchstat in the mix.
This diagnosis is not a 100% sure thing, because the pipes we're reading from are opened in nonblocking mode. So they would have to be very busy to keep avoiding this branch. But it's at least still possible, particularly since this problem seems to strike jobs that have many parallel tasks.
Diagnosis II¶
See Fix II below.
Fix I¶
Suggest checking refresh_trigger (and the other things) every ~two seconds, even during times when readfrompipes()
and reapchildren()
are keeping us 100% busy.
- if (!$gotsome) {
+ if (!$gotsome || $latest_refresh + 2 < scalar time) {
After this is merged and deployed, test whether you can successfully cancel massively parallel jobs. If it works consistently, I think we can mark this ticket resolved. Otherwise, we'll need to do another pass on this.
Fix II¶
In source:sdk/cli/bin/crunch-job, do not stay in this loop forever:while (0 < sysread ($reader{$job}, $buf, 8192)) { print STDERR $buf if $ENV{CRUNCH_DEBUG}; $jobstep[$job]->{stderr_at} = time; $jobstep[$job]->{stderr} .= $buf; preprocess_stderr ($job); if (length ($jobstep[$job]->{stderr}) > 16384) { substr ($jobstep[$job]->{stderr}, 0, 8192) = ""; } $gotsome = 1; }
- Consider raising 8192 -- perhaps 65536.
- Consider changing that "while" to an "if".
Updated by Brett Smith about 9 years ago
- Subject changed from [Crunch] crunch-job not cancelling job to [Crunch] crunch-job does not cancel jobs specifically
- Description updated (diff)
Updated by Brett Smith about 9 years ago
- Category set to Crunch
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Subject changed from [Crunch] crunch-job does not cancel jobs specifically to [Crunch] crunch-job does not cancel jobs consistently
Updated by Tom Clegg about 9 years ago
Suggest checking refresh_trigger (and the other things) every ~two seconds, even during times when readfrompipes()
and reapchildren()
are keeping us 100% busy.
- if (!$gotsome) {
+ if (!$gotsome || $latest_refresh + 2 < scalar time) {
(I'm not confident this will fix the issue at hand, but it's a good idea anyway.)
Updated by Brett Smith about 9 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Story points set to 0.5
Updated by Brett Smith about 9 years ago
- Status changed from In Progress to Feedback
- Assigned To set to Brett Smith
Updated by Brett Smith about 9 years ago
FWIW I canceled a few jobs recently and it worked as expected. I'd want to test with a job that has lots of parallel tasks before declaring victory though.
Updated by Peter Amstutz about 9 years ago
I should note that I just updated the API server Gemfile pin on arvados-cli, which previously was 3 months out of date. So the specific fix made for this ticket may or may not be in production yet.
Updated by Brett Smith about 9 years ago
Peter Amstutz wrote:
I should note that I just updated the API server Gemfile pin on arvados-cli, which previously was 3 months out of date. So the specific fix made for this ticket may or may not be in production yet.
Our clusters run crunch-dispatch with CRUNCH_JOB_BIN set to point to a copy of crunch-job from the arvados-src package. So it uses the version deployed through that package, regardless of what the API server's Gemfile says.
Updated by Brett Smith about 9 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Brett Smith almost 9 years ago
- Status changed from Feedback to New
We have a report from a user that a job on their cluster failed to cancel. Similar circumstances: no trouble canceling jobs with few tasks, but failed to cancel a job that had more tasks than available cores.
Updated by Joshua Randall almost 9 years ago
Actually, the number of tasks (400) was exactly equal to the number of cores on the allocated nodes (10 nodes with 40 cores each). I'm not running any compute on the head node, so all 40 cores there are available for the API server, crunch-dispatch, etc.
I manually changed the period for crunchstat from 10s to 300s and ran it again and this time it was able to cancel the job (although it took ~4m to completely finish being cancelled and free up the nodes).
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Nico César almost 9 years ago
Just what happened today as an example:
API server got rebooted (so job dispacher) but slurm controller is running in manage.wx7k5 then the state ended up as:
compute0.wx7k5:/home/nico# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 34 drain* compute[2-35] compute* up infinite 220 down* compute[36-255] compute* up infinite 2 alloc compute[0-1] crypto up infinite 34 drain* compute[2-35] crypto up infinite 220 down* compute[36-255] crypto up infinite 2 alloc compute[0-1] compute0.wx7k5:/home/nico# squeue_long JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 868 compute wx7k5-8i9sb-258pea56x155uqa crunch R 21:14:04 2 compute[0-1]
but no docker container running:
compute0.wx7k5:/home/nico# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
the solution was
wx7k5:~# scancel 868
should we have a way to do an scancel from workbench?
Updated by Peter Amstutz almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-02-03 Sprint
Updated by Peter Amstutz almost 9 years ago
- Assigned To changed from Brett Smith to Peter Amstutz
Updated by Peter Amstutz almost 9 years ago
- Target version changed from 2016-02-03 Sprint to Arvados Future Sprints
Updated by Tom Clegg almost 9 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-02-17 Sprint
0.5 points to implement fix II.
Updated by Joshua Randall almost 9 years ago
I've merged tomclegg's 7263-better-busy-behavior with the substantial changes I've made to our crunch-job for performance improvements to address this issue.
The fixes in my branch are:
readfrompipes:
- Incorporates the "fix II" from this issue
- Also changes the loop to use `can_read` from IO::Select so that that we only loop over descriptors that are actually ready for reading rather than wasting time calling sysread on a large number of descriptors that are not ready for reading - this is a big performance improvement with large numbers of tasks in a situation where a small number of tasks are producing most of the stderr and also in the non-busy situation.
reapchildren:
- changes behavior of reapchildren from reaping a single child to actually reaping children (i.e. continues to reap children until there are no children ready to be reaped).
- fixes bug where reapchildren did not check WIFEXITED, which could cause reapchildren to mistakenly mark tasks as exited that had only stopped (on some platforms; not actually observed, but this is best practice)
check_refresh_wanted:
- Fixes the broken handling of "%proc" values as "$jobstep"when actually one needs to access "$_->{jobstep}"to get "$jobstep" from "%proc" values.
Updated by Peter Amstutz almost 9 years ago
7263-better-busy-behavior and f9d8685 looks good to me.
Updated by Brett Smith almost 9 years ago
- Status changed from New to In Progress
Updated by Brett Smith almost 9 years ago
- Target version changed from 2016-02-17 Sprint to 2016-03-02 sprint
Updated by Tom Clegg over 8 years ago
7263-better-busy-behavior LGTM. I've merged it into 8099-babysit-all-srun (#8099) which touched some of the same bits (including renaming $jobstep
to $jobstepidx
throughout to avoid some of the confusion that probably made us write these bugs in the first place) so I could try it all together on a test cluster.
Updated by Tom Clegg over 8 years ago
Now just one new commit 0f7709af on wtsi-hgi:hgi/7263-even-better-busy-behavior to review/merge, see #8575. The rest of that branch was merged into #8099, and from there into master.
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
Updated by Tom Clegg over 8 years ago
- Status changed from Resolved to In Progress
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:ec51cd72eabf0f0bbebf7cd979d1a23a1319a416.