Project

General

Profile

Actions

Idea #7965

closed

[API] Script to run on reboot that cleans transient state

Added by Brett Smith over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
12/08/2015
Due date:
Story points:
1.0

Description

Background

If API server crashes or reboots while jobs are running, when it starts up again the database indicates the jobs are still in Running state but no crunch-dispatch or crunch-job process is paying attention to them.

Depending on the way slurm is set up, there might even be slurm jobs/steps still running -- but if the relevant crunch-job process does not exist, they can't succeed. (Presumably slurm will notice this eventually and clean it up by itself, but we might as well do it sooner.)

Unlike jobs, pipeline instances don't have in-process state so we shouldn't need to do anything special for them.

Proposed fix

Add a script in source:services/api/script/fail-jobs.rb that cleans up jobs that have been left in Running state. It should use ActiveRecord etc. directly rather than making API calls: it should be possible to run it before starting the API server.

Cleanup includes:
  • If using slurm, run squeue to find any remaining jobs/steps named after the job UUID, and scancel them
  • Change state to Failed
  • Log a message in the logs table stating that the job was interrupted and failed due to a server reboot.
The script should accept an optional --before argument, with a timestamp or the special value "reboot" meaning grep btime /proc/stat | cut -d" " -f2 or an equivalent way of determining last reboot time. If this argument is used, jobs that were started after the given timestamp will be exempt from cleanup.
  • --before reboot
  • --before 'Tue Nov 24 10:27:17 EST 2015'

Extra

Revoke the API token used by the job. Currently (IIRC) the job↔jobtoken mapping exists only in crunch-dispatch's memory, so this might require using the "properties" field of API tokens to store the job UUID when initially creating the per-job token, and then searching unexpired tokens (created since the job was created) for the relevant UUID.

The script should accept an optional --retry flag. If this is given, jobs should be moved back to Queued state instead of cancelled (this will involve bypassing validation using job.save(validate: false); it's probably best to call save(validate: false) after changing only the state, so validations still get a chance to check any other changes we're making). Scrub fields like progress/log/output, and delete all related JobTasks, before putting the job back in the queue (i.e., avoid races if crunch-dispatch is already running). Note this depends on the token revocation feature; otherwise a process started by the previous attempt might be still running somewhere, and still able to update the job record after we scrubbed and restarted it.


Subtasks 2 (0 open2 closed)

Task #8053: refactor "cancel stale jobs"ResolvedTom Clegg12/08/2015Actions
Task #8035: review 7965-fail-abandoned-jobsResolvedPeter Amstutz12/08/2015Actions
Actions #1

Updated by Brett Smith over 8 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith over 8 years ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
  • Category set to API
Actions #4

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #5

Updated by Brett Smith over 8 years ago

  • Story points set to 1.0
Actions #6

Updated by Brett Smith over 8 years ago

  • Target version set to Arvados Future Sprints
Actions #7

Updated by Brett Smith over 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-01-06 sprint
Actions #8

Updated by Tom Clegg over 8 years ago

  • Assigned To set to Tom Clegg
Actions #9

Updated by Tom Clegg over 8 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg over 8 years ago

7965-fail-abandoned-jobs @ 5f35230

https://ci.curoverse.com/job/developer-test-job/76/ failed a websocket test but I'm assuming that's just a flaky test. :(

Actions #11

Updated by Tom Clegg over 8 years ago

  • Description updated (diff)
Actions #12

Updated by Peter Amstutz over 8 years ago

Looks good to me.

Actions #13

Updated by Tom Clegg over 8 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:378e6e0cd313541c395893e832e82a85856d5105.

Actions

Also available in: Atom PDF