Compute work lifecycle » History » Version 1

Brett Smith, 01/15/2016 09:59 PM
first draft

1 1 Brett Smith
h1. Compute work lifecycle
2 1 Brett Smith
3 1 Brett Smith
This document gives an overview of the components involved in running compute jobs on Arvados, with a focus on helping you find and diagnose issues.
4 1 Brett Smith
5 1 Brett Smith
# crunch-dispatch.rb --pipelines
6 1 Brett Smith
  crunch-dispatch is a script in the API server package.  When running with the --pipelines switch, it monitors pipeline instances that have been submitted to run on Arvados.  As it becomes possible to run individual components, it creates jobs to do that.
7 1 Brett Smith
  If a pipeline instance is queued, but no jobs are spawned from it, that suggests there is no crunch-dispatch.rb process running with the --pipelines switch.
8 1 Brett Smith
# crunch-dispatch.rb --jobs (plus optionally Node Manager)
9 1 Brett Smith
  When running with the --jobs switch, crunch-dispatch monitors the job queue, trying to allocate compute nodes for them in SLURM using the salloc command.  Once an allocation is made, it starts a crunch-job process to actually run the job.
10 1 Brett Smith
  If a job is queued, but never starts, it means that crunch-dispatch can't allocate nodes for it.
11 1 Brett Smith
  * Check the job's runtime constraints.  If the job requests hardware that is not available on compute nodes, crunch-dispatch will let it sit in the queue indefinitely until hardware becomes available.  If that hardware will not become available anytime soon, you need to cancel the job and adjust the runtime constraints.
12 1 Brett Smith
    If you're running Node Manager, check the logs for the string "<job UUID> not satisfiable".  If you see that, it means the job's runtime constraints cannot be satisfied with any node size Node Manager is configured to use.
13 1 Brett Smith
  * If the runtime constraints are fine, and you're running Node Manager, and the job has been queued 15+ minutes, Node Manager's internal state is (regrettably) probably out of sync.  Restart it, with SIGKILL if needed.
14 1 Brett Smith
  * If you're not running Node Manager, or a restart doesn't fix it, check the compute node state in SLURM.  Nodes may be in the DOWN state due to SLURM communication problems, etc.
15 1 Brett Smith
# crunch-job
16 1 Brett Smith
  Fundamentally, crunch-job is started by crunch-dispatch.rb --jobs, with information about the job to run and its node allocation from SLURM.  crunch-job uses a series of srun commands to set up the compute nodes to do the work, and actually run the script.
17 1 Brett Smith
  If crunch-job has started running a job, but the job appears in be wedged, that's usually because of a SLURM hiccup.  These come in a wide variety of flavors to suit every taste.  Check the logs for the crunch-dispatch --jobs process that started this crunch-job: it will include every line of output coming from it.  (Logs not associated with any specific job are also suspect; the logs give you no way to know what child they come from).  You might also check the SLURM logs.
18 1 Brett Smith
  Note that if crunch-job encounters an error that it considers to be a temporary failure, it will exit with a special exit code.  crunch-dispatch --jobs will use this signal to keep trying to run the job again, up to a limit of three attempts.  So it's possible to go back up a level in this page, and come down again a couple of times.