Project

General

Profile

Compute work lifecycle » History » Revision 2

Revision 1 (Brett Smith, 01/15/2016 09:59 PM) → Revision 2/3 (Brett Smith, 01/15/2016 10:02 PM)

h1. Compute work lifecycle 

 This document gives an overview of the components involved in running compute jobs on Arvados, with a focus on helping you find and diagnose issues. 

 # crunch-dispatch.rb --pipelines 
   crunch-dispatch is a script in the API server package.    When running with the --pipelines switch, it monitors pipeline instances that have been submitted to run on Arvados.    As it becomes possible to run individual components, it creates jobs to do that. 
   If a pipeline instance is queued, but no jobs are spawned from it, that suggests there is no crunch-dispatch.rb process running with the --pipelines switch. 
 # crunch-dispatch.rb --jobs (plus optionally Node Manager) 
   When running with the --jobs switch, crunch-dispatch monitors the job queue, trying to allocate compute nodes for them in SLURM using the salloc command.    Once an allocation is made, it starts a crunch-job process to actually run the job. 
   If a job is queued, but never starts, it means that crunch-dispatch can't allocate nodes for it. 
   * Check the job's runtime constraints.    If the job requests hardware that is not available on compute nodes, crunch-dispatch will let it sit in the queue indefinitely until hardware becomes available.    If that hardware will not become available anytime soon, you need to cancel the job and adjust the runtime constraints. 
     If you're running Node Manager, check the logs for the string "<job UUID> not satisfiable".    If you see that, it means the job's runtime constraints cannot be satisfied with any node size Node Manager is configured to use. 
   * If the runtime constraints are fine, and you're running Node Manager, and the job has been queued 15+ minutes, Node Manager's internal state is (regrettably) probably out of sync.    Restart it, with SIGKILL if needed. 
   * If you're not running Node Manager, or a restart doesn't fix it, check the compute node state in SLURM.    Nodes may be in the DOWN state due to SLURM communication problems, etc. 
 # crunch-job 
   Fundamentally, crunch-job is started by crunch-dispatch.rb --jobs, with information about the job to run and its node allocation from SLURM.    crunch-job uses a series of srun commands to set up the compute nodes to do the work, and actually run the script. 
   Actual compute work (as opposed to prep work) is encapsulated in job tasks.    crunch-job automatically creates a task 0 for each job to represent the start of work.    Any job task can create more job tasks to do additional work; crunch-job keeps dispatching these in a loop until all of them have finished.    Basic information about the status and number of tasks are recorded in the job's tasks_summary field.    You can also look up each job task via the API.    You'll probably want to filter on the job_uuid field. 
   If crunch-job has started running a job, but the job appears in be wedged, that's usually because of a SLURM hiccup.    These come in a wide variety of flavors to suit every taste.    Check the logs for the crunch-dispatch --jobs process that started this crunch-job: it will include every line of output coming from it.    (Logs not associated with any specific job are also suspect; the logs give you no way to know what child they come from).    You might also check the SLURM logs. 
   Note that if crunch-job encounters an error that it considers to be a temporary failure, it will exit with a special exit code.    crunch-dispatch --jobs will use this signal to keep trying to run the job again, up to a limit of three attempts.    So it's possible to go back up a level in this page, and come down again a couple of times.