Project

General

Profile

Reusable tasks » History » Revision 2

Revision 1 (Tom Clegg, 10/06/2014 03:00 PM) → Revision 2/4 (Tom Clegg, 10/08/2014 03:28 PM)

{{>toc}} 

 h1=. Reusable tasks 

 p>. *"Tom Clegg":mailto:tom@curoverse.com 
 Last Updated: October 6, 2014* 

 h2. Overview 

 h3. Objective 

 Say jobs A and B, although not identical, have some tasks in common. Job A is complete. Job B starting now. They use the same script, version, docker image, etc. The only difference between A and B is that B's input collection has one more file; the rest of the files are identical. The script processes each input file independently, and it is a pure function (re-computing the same files will produce the same result). This means most of Job B's work has already been done. Task re-use will allow Arvados to recognize this condition and re-use the outputs of Job A's tasks instead of recomputing them. 

 Task re-use will not attempt to detect equivalence conditions like differently-encoded collection manifests with identical data, differing git commits with identical trees, and differing docker images with functionally equivalent content. 

 The intended audience for this document is software engineers. 

 h3. Background 

 The arvados.v1.jobs.create API offers _The background section should contain information the reader needs to know to understand the problem being solved. This can be a find_or_create feature which searches combination of text and links to other documents._ 

 h3. Alternatives 

 _This section contains alternative solutions to the stated objective, as well as explanations for an existing job which meets criteria specified why they weren't used. In the planning stage, this section is useful for understanding the value added by the client (e.g., same script, compatible script_version) proposed solution. Once the system has been implemented, this section will inform readers of alternative solutions so they can find the best system to address their needs._ 

 h3. Tradeoffs 

 _What tradeoffs were made in this design and additional criteria (e.g., did not fail, *why*. Types of tradeoffs can include: different approaches that could have been taken (e.g. storing data in memory/on disk/on network), or design decisions such as optimizing for latency vs throughput. The important part is not marked impure/nondeterministic, does not diagree with other jobs passing to explain your reasoning for making the same criteria about what choice you did (or admitting if you felt the correct output is). choice was arbitrary)._ 

 * http://doc.arvados.org/api/methods/jobs.html#create 

 h3. Alternatives High Level Design 

 Always recompute each task (i.e., leave existing behavior). 

 bq. _A high-level description of the system. This makes desirable use cases prohibitively expensive. 

 Use smaller jobs, and more jobs per pipeline. 

 bq. We could make is the dynamic-structure capabilities most valuable section of crunch jobs available at the pipeline level, document and de-emphasize or stop using will probably receive the features that encourage long-running jobs. Disadvantages include: 
 * The process of running most attention. You should explain, at a pipeline high level, how your system will work. Don't get bogged down with details, those belong later in the document._ 

 _A diagram showing how the major components communicate is not done very useful and a great way to start this section. If this system is intended to be a component in a controlled environment. This effectively reduces larger system, a diagram showing how it fits in to the utility of reproducibility and provenance features. 
 * Pipelines are currently encoded larger system will also be appreciated by your readers._ 

 _Most diagrams will need to be updated over time as JSON which the design evolves, so please create your diagrams with a program that is awkward easily (and freely) available and attach the diagram source to use as the document to make it easy for a DSL. future maintainer (who could be you) to update the diagrams along with the document._ 

 h3. Tradeoffs h2. Specifics 

 _TODO_ _Nothing goes here; all the content belongs in the subsections._ 

 h3. High Level Detailed Design 

 Before executing a job_task _Designs that qualifies are too detailed for re-use, crunch-job uses the API above [[Design_Doc_Template#High-Level-Design|High Level Design]] section belong here. Anything that will require a day or more of work to discover existing job_tasks that are functionally identical, are marked as "pure", implement, should be described here._ 

 _This is a great place to put APIs, communication protocols, file formats and have already finished. the like._ 

 h2. Specifics 

 h3. Detailed Design 

 The JobTask schema _It is important to include assumptions about what external systems will provide. For example if this system has a new boolean flag @is_pure@ (not null, default @false@). 

 Just before starting method that takes a task having @is_pure==true@, crunch-job user id as input, will your implementation assume that the user id is valid? Or if a method has a string parameter, does an API query look up other tasks with @is_pure=true@ and identical inputs, parameters, script_version, etc. 
 * Some attributes like script and script_version are currently stored in it assume that the job record, not parameter has been sanitized against injection attacks? Having such assumptions explicitly spelled out here before you start implementing increases the job_task record. This chances that misunderstandings will make be caught by a reviewer before they lead to bugs or vulnerabilities. Please reference the lookup interesting, external system's documentation justifying your assumption whenever possible (and if such documentation doesn't exist, ask the external system's author to document the behavior or at least confirm it in the absence an email)._ 

 _Here's an easy rule of thumb for deciding what to write here: Think of anything that would be a generic "join" API. 

 Job tasks have one especially noteworthy side effect: queueing additional tasks. In order pain to reuse tasks safely without races, we need additional restraints: 
 * Tasks with @is_pure==true@ cannot queue additional tasks, *and* @is_pure@ cannot change from @false@ if you were requested to @true@. 
 * Tasks do not qualify for reuse until they have completed.[1] When reusing so in a task, copy (and reset to "todo" state) each task whose @created_by_job_task_uuid@ attribute references the task being reused. 

 fn1. At least code review. If you put that implementation detail in the short term, this constraint is a good way here, you'll be less likely to limit be asked to change it once you've written all the complexity of implementation without sacrificing too much of the user benefit. code._ 

 h3. Code Location 

 @sdk/cli/bin/crunch-job@ will have new task reuse logic. 

 @services/api/db/migrate@ will have a new migration, which will be reflected _The path of the source code in @services/api/db/structure.sql@. 

 @services/api/app/models/job_task.rb@ will add :is_pure to the API response and prohibit @is_pure@ from changing from @false@ to @true@. repository._ 

 @doc/api/schema/JobTask.html.textile.liquid@ will document the :is_pure flag. 

 h3. Testing Plan 

 _TODO_ _How you will verify the behavior of your system. Once the system is written, this section should be updated to reflect the current state of testing and future aspirations._ 

 h3. Logging 

 @crunch-job@ _What your system will log the fact that it has copied its output attribute (and, if applicable, queued additional tasks) from an existing completed task. record and how._ 

 h3. Debugging 

 _TODO_ _How users can debug interactions with your system. When designing a system it's important to think about what tools you can provide to make debugging problems easier. Sometimes it's unclear whether the problem is in your system at all, so a mechanism for isolating a particular interaction and examining it to see if your system behaved as expected is very valuable. Once a system is in use, this is a great place to put tips and recipes for debugging. If this section grows too large, the mechanisms can be summarized here and individual tips can be moved to another document._ 

 h3. Caveats 

 _Gotchas, differences between the design and implementation, other potential stumbling blocks for users or maintainers, and their implications and workarounds. Unless something is known to be tricky ahead of time, this section will probably start out empty._ 

 _Rather than deleting it, it's recommended that you keep this section with a simple place holder, since caveats will almost certainly appear down the road._ 

 To be determined. 

 h3. Security Concerns 

 _TODO_ _This section should describe possible threats (denial of service, malicious requests, etc) and what, if anything, is being done to protect against them. Be sure to list concerns for which you don't have a solution or you believe don't need a solution. Security concerns that we don't need to worry about also belong here (e.g. we don't need to worry about denial of service attacks for this system because it only receives requests from the api server which already has DOS attack protections)._ 

 h3. Open Questions and Risks 

 _TODO_ _This section should describe design questions that have not been decided yet, research that needs to be done and potential risks that could make make this system less effective or more difficult to implement._ 

 _Some examples are: Should we communicate using TCP or UDP? How often do we expect our users to interrupt running jobs? This relies on an undocumented third-party API which may be turned off at any point._ 

 _For each question you should include any relevant information you know. For risks you should include estimates of likelihood, cost if they occur and ideas for possible workarounds._ 

 h3. Work Estimates 

 _TODO_ _Split the work into milestones that can be delivered, put them in the order that you think they should be done, and estimate roughly how much time you expect it each milestone to take. Ideally each milestone will take one week or less._ 

 h3. Future Work 

 _TODO_ _Features you'd like to (or will need to) add but aren't required for the current release. This is a great place to speculate on potential features and performance improvements._ 

 h3. Revision History 

 |_.Date              |_.Revisions Made |_.Author              |_.Reviewed By       | 
 | October 6, 2014 | Initial Draft           | Tom Clegg |=. ----                |