Story #7965

[API] Script to run on reboot that cleans transient state

Added by Brett Smith almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
12/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Background

If API server crashes or reboots while jobs are running, when it starts up again the database indicates the jobs are still in Running state but no crunch-dispatch or crunch-job process is paying attention to them.

Depending on the way slurm is set up, there might even be slurm jobs/steps still running -- but if the relevant crunch-job process does not exist, they can't succeed. (Presumably slurm will notice this eventually and clean it up by itself, but we might as well do it sooner.)

Unlike jobs, pipeline instances don't have in-process state so we shouldn't need to do anything special for them.

Proposed fix

Add a script in source:services/api/script/fail-jobs.rb that cleans up jobs that have been left in Running state. It should use ActiveRecord etc. directly rather than making API calls: it should be possible to run it before starting the API server.

Cleanup includes:
  • If using slurm, run squeue to find any remaining jobs/steps named after the job UUID, and scancel them
  • Change state to Failed
  • Log a message in the logs table stating that the job was interrupted and failed due to a server reboot.
The script should accept an optional --before argument, with a timestamp or the special value "reboot" meaning grep btime /proc/stat | cut -d" " -f2 or an equivalent way of determining last reboot time. If this argument is used, jobs that were started after the given timestamp will be exempt from cleanup.
  • --before reboot
  • --before 'Tue Nov 24 10:27:17 EST 2015'

Extra

Revoke the API token used by the job. Currently (IIRC) the job↔jobtoken mapping exists only in crunch-dispatch's memory, so this might require using the "properties" field of API tokens to store the job UUID when initially creating the per-job token, and then searching unexpired tokens (created since the job was created) for the relevant UUID.

The script should accept an optional --retry flag. If this is given, jobs should be moved back to Queued state instead of cancelled (this will involve bypassing validation using job.save(validate: false); it's probably best to call save(validate: false) after changing only the state, so validations still get a chance to check any other changes we're making). Scrub fields like progress/log/output, and delete all related JobTasks, before putting the job back in the queue (i.e., avoid races if crunch-dispatch is already running). Note this depends on the token revocation feature; otherwise a process started by the previous attempt might be still running somewhere, and still able to update the job record after we scrubbed and restarted it.


Subtasks

Task #8053: refactor "cancel stale jobs"ResolvedTom Clegg

Task #8035: review 7965-fail-abandoned-jobsResolvedPeter Amstutz

Associated revisions

Revision 378e6e0c
Added by Tom Clegg almost 5 years ago

Merge branch '7965-fail-abandoned-jobs' closes #7965

History

#1 Updated by Brett Smith almost 5 years ago

  • Description updated (diff)

#2 Updated by Brett Smith almost 5 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)
  • Category set to API

#4 Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)

#5 Updated by Brett Smith almost 5 years ago

  • Story points set to 1.0

#6 Updated by Brett Smith almost 5 years ago

  • Target version set to Arvados Future Sprints

#7 Updated by Brett Smith almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-01-06 sprint

#8 Updated by Tom Clegg almost 5 years ago

  • Assigned To set to Tom Clegg

#9 Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress

#10 Updated by Tom Clegg almost 5 years ago

7965-fail-abandoned-jobs @ 5f35230

https://ci.curoverse.com/job/developer-test-job/76/ failed a websocket test but I'm assuming that's just a flaky test. :(

#11 Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)

#12 Updated by Peter Amstutz almost 5 years ago

Looks good to me.

#13 Updated by Tom Clegg almost 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:378e6e0cd313541c395893e832e82a85856d5105.

Also available in: Atom PDF