Feature #6858
closed[Documentation] Document the necessary steps to re-run jobs without computing again
Description
Write that you need to explicitly state the git commit hash (script_version), arvados_sdk_version, docker_image hash in your pipeline template in order for jobs not to re-compute. (Along with obvious ones such as not changing your input)
Updated by Brett Smith over 9 years ago
- Category set to Documentation
- Target version set to 2015-08-19 sprint
Suggest adding this to "Writing a pipeline template."
Updated by Radhika Chippada over 9 years ago
- Assigned To changed from Bryan Cosca to Radhika Chippada
Updated by Bryan Cosca over 9 years ago
- Assigned To changed from Radhika Chippada to Bryan Cosca
Updated by Bryan Cosca over 9 years ago
Radhika,
I added a note on the bottom of doc/user/tutorials/running-external-program.html about job reproducibility in pipeline templates.
Updated by Radhika Chippada over 9 years ago
- Is there an extra “and” in the description of the ticket (... arvados_sdk_version, docker_image hash
andin your pipeline template in order for jobs not to re-compute ... )?
- Can you please make it a separate section similar to “Running your pipeline”? I think making it a separate section and adding some description of what this section is aiming for would be helpful.
- Something like, “you can reuse jobs and thus save on computing time and resources …”, “if you want to rerun the job with the same script and input etc …”, “you can reuse only portions of the pipeline, for example, reuse job1 but rerun job2 since something about job2 changed …” etc.
- Also, your current description is good as well and please include this also along with what I listed.
- Also, please specify that the reuse can only be done when inputs did not change
- “"arvados_sdk_version" : The latest version can be found on the Arvados Python sdk repository under Latest revisions.”
- Does it make sense to add this or a variation of this in the previous section in runtime_constraints area where arvados_sdk_version is introduced as well (“arvados_sdk_version" specifies a version of the Arvados SDK to load alongside the job’s script”)? I am thinking expanding this pointer to say, “the example uses ‘master’ . If you would like to use s specific version of the sdk, you can find it in the Arvados Python sdk repository … “.
- Also, please expand the explanation of this in this new section as well
- script_version: can you please explain this a bit more? In the previous section, we say “These parameters are described in more detail in Writing a script”, but when I go there, I did not really find much explanation about what it is. Something along the lines “this is the version of your script …” and “the version info can be found in your git repository …”. You already talked about finding in user's git repository, but I think a bit more verbosity would help.
- docker_image: can you please copy and paste the explanation about this from the previous section into this? You can reword or expand if you would like to.
Updated by Bryan Cosca over 9 years ago
Radhika Chippada wrote:
- Is there an extra “and” in the description of the ticket (... arvados_sdk_version, docker_image hash
andin your pipeline template in order for jobs not to re-compute ... )?
yes
- Can you please make it a separate section similar to “Running your pipeline”? I think making it a separate section and adding some description of what this section is aiming for would be helpful.
done
- Something like, “you can reuse jobs and thus save on computing time and resources …”, “if you want to rerun the job with the same script and input etc …”, “you can reuse only portions of the pipeline, for example, reuse job1 but rerun job2 since something about job2 changed …” etc.
- Also, your current description is good as well and please include this also along with what I listed.
- Also, please specify that the reuse can only be done when inputs did not change
I changed the wording and tried to keep it concise.
- “"arvados_sdk_version" : The latest version can be found on the Arvados Python sdk repository under Latest revisions.”
- Does it make sense to add this or a variation of this in the previous section in runtime_constraints area where arvados_sdk_version is introduced as well (“arvados_sdk_version" specifies a version of the Arvados SDK to load alongside the job’s script”)? I am thinking expanding this pointer to say, “the example uses ‘master’ . If you would like to use s specific version of the sdk, you can find it in the Arvados Python sdk repository … “.
I added the change in the top example as well as the new section.
- Also, please expand the explanation of this in this new section as well
- script_version: can you please explain this a bit more? In the previous section, we say “These parameters are described in more detail in Writing a script”, but when I go there, I did not really find much explanation about what it is. Something along the lines “this is the version of your script …” and “the version info can be found in your git repository …”. You already talked about finding in user's git repository, but I think a bit more verbosity would help.
I added another sentence about this.
- docker_image: can you please copy and paste the explanation about this from the previous section into this? You can reword or expand if you would like to.
done
Updated by Radhika Chippada over 9 years ago
- Can you please remove the extra “and” from description then?
- “This section shows what parameters you need to version control in order” => how about something like “this sections shows which version control parameters should be tuned to make sure …”
- arvados_sdk_version: can you also please add something like “make sure you set this to the same version as the previous run that you are trying to reuse …”
- “the crunch script resides” => “the crunch script resides in” ?
- “where job’s run their scripts” => should it be jobs, not job’s?
- “Docker version control is similar to git, you can commit and push changes to your images” => “Docker version control is similar to git, and you can commit and push changes to your images”
- “In order to version control your docker image on arvados, you must use the docker image hash which is found on the Collection page as the Content address”. This is still confusing. Can we break it into two sentences, something like “… you must reuse the docker image hash from the previous run. It can be found on the Collection page as the Content address …” ?
Updated by Bryan Cosca over 9 years ago
Radhika Chippada wrote:
- Can you please remove the extra “and” from description then?
- “This section shows what parameters you need to version control in order” => how about something like “this sections shows which version control parameters should be tuned to make sure …”
- arvados_sdk_version: can you also please add something like “make sure you set this to the same version as the previous run that you are trying to reuse …”
- “the crunch script resides” => “the crunch script resides in” ?
- “where job’s run their scripts” => should it be jobs, not job’s?
- “Docker version control is similar to git, you can commit and push changes to your images” => “Docker version control is similar to git, and you can commit and push changes to your images”
- “In order to version control your docker image on arvados, you must use the docker image hash which is found on the Collection page as the Content address”. This is still confusing. Can we break it into two sentences, something like “… you must reuse the docker image hash from the previous run. It can be found on the Collection page as the Content address …” ?
all completed in 4f4d8ae4e69ed5e374990ec56090e7d4b8926b6b
Updated by Radhika Chippada over 9 years ago
There is an extra space character after "This section shows which version control parameters should be tuned to make sure Arvados will not re-compute your jobs". Please remove this.
LGTM. Thanks.
Updated by Bryan Cosca over 9 years ago
- Status changed from New to Resolved
Applied in changeset arvados|commit:8089b2f5c97b1db9bd826a1b6488f1b060830def.