Crunch1-in-Crunch2 » History » Revision 2
Revision 1 (Tom Clegg, 06/11/2015 07:36 PM) → Revision 2/3 (Tom Clegg, 07/01/2015 05:47 PM)
{{>TOC}} h1. Crunch1-in-Crunch2 (DRAFT) Detail about how Crunch2 runs jobs that were written for Crunch1. See * Crunch2 [[Jobs API]] * [[Crunch2 migration]] h2. Background In order for Crunch2 to replace Crunch1, Crunch2 must: * run must be capable of running jobs that rely on Crunch1's API, like API. Examples: ** * run-command ** * arv-run (via run-command) ** * existing tutorial/example jobs ** * user scripts based on existing tutorials * accept job submissions from clients using the Crunch1 API, like ** arv-run-pipeline-instance ** user scripts * maintain the ability to view progress of Crunch1 jobs using Crunch1 clients, like ** Workbench ** arv-run-pipeline-instance h2. Requirements Crunch1 jobs rely on the following pieces: * Keep mount available within the container * Some environment variables (CRUNCH_SRC, ARVADOS_API_*, etc) * jobs and job_tasks APIs for executing work on multiple nodes h2. Approach h3. Submitting arv-run-pipeline-instance fulfills a job Translate the incoming Crunch1 job submission to a Crunch2 job request. * The container/command given in the job request are determined pipeline instance by submitting and monitoring jobs using the server configuration. The Crunch1 API doesn't specify [which version of] crunch-job is to be used. Create the job request using the JobRequests controller. Create a job record just as before, but set a flag so crunch-dispatch doesn't try to run it. API. (This piece could be implemented as replaced by a "Proxy" state.) h3. Running a job Once it has been translated to a job request, a Crunch1 job is merely a Crunch2 job (the "parent") which acts as any "workflow runner" would: it submits additional job requests of its own (the "children"). Its notable difference is that it uses an additional communication channel not normally used by Crunch2 jobs: * The children perform Arvados API requests (jobs.get, job_tasks.get, job_tasks.update, and job_tasks.create) to get information about themselves and to ask equivalent if necessary; however, users may have written job-submission/monitoring scripts along the parent to submit more job requests. * The parent performs Arvados API requests (presumably job_tasks.list same lines and job_tasks.get) it's ideal if these also continue to get the information submitted by the children. The Crunch1 runner implements the same algorithm as crunch-job, but with a few simplifying restrictions. work.) * It has only one way to run tasks: submit a jobrequest[1]. * It doesn't construct docker command lines, or run docker itself: instead, it writes Crunch2 job requests. * It doesn't retry tasks. Crunch2 is responsible for this. * It doesn't look for node failure. Ditto Crunch2. * It doesn't copy stderr to Keep. Ditto Crunch2. * It doesn't know anything about slurm. With all that stuff removed, Keep mount available within the Crunch1 runner algorithm reduces to something like this: container * Submit a job_request for "task 0". Some environment variables (CRUNCH_SRC, ARVADOS_API_*, etc) * When the assigned job succeeds, look for new job_tasks that it submitted. Add these to a list of "pending" tasks. * Take min(sequence) across all pending job_tasks. Translate job_tasks with that sequence out of "pending" and submit them as job_requests. * Repeat until all submitted job_requests have been assigned and finished, and "pending" is empty. * Collate task outputs into a job output. TBD: * If a child job (formerly "job_task") sets the parent job's (formerly "job's") output attribute, it cannot be reused to fulfill a future job request. Either this should be handled transparently, or this use case should be prohibited (at the cost of breaking some Crunch1 jobs). * If a child job reads the parent job record (which is nearly universal among Crunch1 jobs) it cannot be reused to fulfill a future job request _except_ where the future job request would return the same values. This could be ensured by copying the parent job's crunch1 job record into the crunch2 job request's inputs -- however, this would effectively prevent _any_ crunch1 job from reusing tasks across non-identical jobs. fn1. This means "local dev jobs" will require a dev/transient install of the Crunch infrastructure. This is probably a good thing overall, but does mean we need to do the work of making the transient infrastructure spring up quickly and easily. h3. Getting job status Clients must be able to get the current status of a Crunch1 job (i.e., one that was submitted with the Crunch1 API) by using the Crunch1 "list" and "get" APIs. This is necessary for existing clients (including Crunch1 jobs themselves) to continue working without modifications after Crunch2 has replaced Crunch1 as the execution engine. Clients must be able to get job status for both Crunch1- and Crunch2-submitted jobs using only the Crunch2 "list" and "get" APIs. This makes it possible to migrate Workbench from Crunch1 to Crunch2 without losing the ability to see old jobs. However, it is _not_ necessary for Crunch1 clients to see Crunch2 jobs. When a job is/was executed by Crunch2, the Crunch2 API is the source of truth about its state. Therefore: * Crunch1 APIs that modify a job must also modify the corresponding Crunch2 record(s). This might be the empty set, though: crunch-job's replacement will use Crunch2 directly rather than using Crunch1's jobs.update API to update job output/progress, for example. * Crunch1 APIs that retrieve a job must read the Crunch2 record. h3. job_tasks APIs The job_tasks APIs are used by Crunch1 jobs to communicate between crunch-job and the processes it runs on allocated nodes. The API server doesn't need to touch these.