Idea #7965
Updated by Tom Clegg almost 9 years ago
h2. Background If API server crashes or reboots while jobs are running, when it starts up again the database indicates the jobs are still in Running state but no crunch-dispatch or crunch-job process is paying attention to them. Depending on the way slurm is set up, there might even be slurm jobs/steps still running -- but if the relevant crunch-job process does not exist, they can't succeed. (Presumably slurm will notice this eventually and clean it up by itself, but we might as well do it sooner.) Unlike jobs, pipeline instances don't have in-process state so we shouldn't need to do anything special for them. h2. Proposed fix Add a script in source:services/api/script/fail-jobs.rb that cleans up jobs that have been left in Running state. It should use ActiveRecord etc. directly rather than making API calls: it should be possible to run it before starting the API server. Cleanup includes: * If using slurm, run squeue to find any remaining jobs/steps named after scancel all the job UUID, and scancel them existing allocations * Change state to Failed Mark running jobs/pipelines failed * Log a message in the logs table stating that the job was interrupted and Automatically rerun failed due to a server reboot. The script should accept an optional @--created-before@ argument, with a timestamp or the special value "reboot" meaning @grep btime /proc/stat | cut -d" " -f2@ or an equivalent way of determining last reboot time. If this argument is used, jobs that were created _after_ the given timestamp will be _exempt_ from cleanup. * @--before reboot@ * @--before 'Tue Nov 24 10:27:17 EST 2015'@ h2. Extra Revoke the API token used by the job. Currently (IIRC) the job↔jobtoken mapping exists only in crunch-dispatch's memory, so this might require using the "properties" field of API tokens to store the job UUID when initially creating the per-job token, and then searching unexpired tokens (created since the job was created) for the relevant UUID. The script should accept an optional @--retry@ flag. If this is given, jobs should be moved back to Queued state instead of cancelled. Scrub fields like progress/log/output, and delete all related JobTasks before updating the job. Note this depends on the token revocation feature; otherwise a process started by the previous attempt might be still running somewhere, and still able to update the job record after we scrubbed and restarted it. pipelines/jobs?