Project

General

Profile

Idea #7965

Updated by Tom Clegg over 8 years ago

h2. Background 

 If API server crashes or reboots while jobs are running, when it starts up again the database indicates the jobs are still in Running state but no crunch-dispatch or crunch-job process is paying attention to them. 

 Depending on the way slurm is set up, there might even be slurm jobs/steps still running -- but if the relevant crunch-job process does not exist, they can't succeed. (Presumably slurm will notice this eventually and clean it up by itself, but we might as well do it sooner.) 

 Unlike jobs, pipeline instances don't have in-process state so we shouldn't need to do anything special for them. 

 h2. Proposed fix 

 Add a script in source:services/api/script/fail-jobs.rb that cleans up jobs that have been left in Running state. It should use ActiveRecord etc. directly rather than making API calls: it should be possible to run it before starting the API server. 

 Cleanup includes: 
 * If using slurm, run squeue to find any remaining jobs/steps named after the job UUID, and scancel them 
 * Change state to Failed 
 * Log a message in the logs table stating that the job was interrupted and failed due to a server reboot. 

 The script should accept an optional @--created-before@ argument, with a timestamp or the special value "reboot" meaning @grep btime /proc/stat | cut -d" " -f2@ or an equivalent way of determining last reboot time. If this argument is used, jobs that were created _after_ the given timestamp will be _exempt_ from cleanup. 
 * @--before reboot@ 
 * @--before 'Tue Nov 24 10:27:17 EST 2015'@ 

 h2. Extra 

 Revoke the API token used by the job. Currently (IIRC) the job↔jobtoken mapping exists only in crunch-dispatch's memory, so this might require using the "properties" field of API tokens to store the job UUID when initially creating the per-job token, and then searching unexpired tokens (created since the job was created) for the relevant UUID. 

 The script should accept an optional @--retry@ flag. If this is given, jobs should be moved back to Queued state instead of cancelled (this will involve bypassing validation using @job.save(validate: false)@; it's probably best to call @save(validate: false)@ after changing _only_ the state, so validations still get a chance to check any other changes we're making). cancelled. Scrub fields like progress/log/output, and delete all related JobTasks, _before_ putting JobTasks before updating the job back in the queue (i.e., avoid races if crunch-dispatch is already running). job. Note this depends on the token revocation feature; otherwise a process started by the previous attempt might be still running somewhere, and still able to update the job record after we scrubbed and restarted it. 

Back