Idea #7965
closed[API] Script to run on reboot that cleans transient state
Description
Background¶
If API server crashes or reboots while jobs are running, when it starts up again the database indicates the jobs are still in Running state but no crunch-dispatch or crunch-job process is paying attention to them.
Depending on the way slurm is set up, there might even be slurm jobs/steps still running -- but if the relevant crunch-job process does not exist, they can't succeed. (Presumably slurm will notice this eventually and clean it up by itself, but we might as well do it sooner.)
Unlike jobs, pipeline instances don't have in-process state so we shouldn't need to do anything special for them.
Proposed fix¶
Add a script in source:services/api/script/fail-jobs.rb that cleans up jobs that have been left in Running state. It should use ActiveRecord etc. directly rather than making API calls: it should be possible to run it before starting the API server.
Cleanup includes:- If using slurm, run squeue to find any remaining jobs/steps named after the job UUID, and scancel them
- Change state to Failed
- Log a message in the logs table stating that the job was interrupted and failed due to a server reboot.
--before
argument, with a timestamp or the special value "reboot" meaning grep btime /proc/stat | cut -d" " -f2
or an equivalent way of determining last reboot time. If this argument is used, jobs that were started after the given timestamp will be exempt from cleanup.
--before reboot
--before 'Tue Nov 24 10:27:17 EST 2015'
Extra¶
Revoke the API token used by the job. Currently (IIRC) the job↔jobtoken mapping exists only in crunch-dispatch's memory, so this might require using the "properties" field of API tokens to store the job UUID when initially creating the per-job token, and then searching unexpired tokens (created since the job was created) for the relevant UUID.
The script should accept an optional --retry
flag. If this is given, jobs should be moved back to Queued state instead of cancelled (this will involve bypassing validation using job.save(validate: false)
; it's probably best to call save(validate: false)
after changing only the state, so validations still get a chance to check any other changes we're making). Scrub fields like progress/log/output, and delete all related JobTasks, before putting the job back in the queue (i.e., avoid races if crunch-dispatch is already running). Note this depends on the token revocation feature; otherwise a process started by the previous attempt might be still running somewhere, and still able to update the job record after we scrubbed and restarted it.