Bug #5287

Port job submission to use Arvados

Added by Abram Connelly over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Target version:
Start date:
02/20/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Billable:
Estimatedhours:
Hours:
Totalhours:
Resolution:
Story points:
-

Description

GET-Evidence currently has vestigial code pointing to Free Factories along with code that relies on the 'warehouse' datastore. Port these to Arvados (Crunch and Keep).


Subtasks

Task #5293: Code review 'arvados-migration'ResolvedWard Vandewege

Associated revisions

Revision 9b277b44
Added by Abram Connelly about 4 years ago

Merge branch 'arvados-migration'

closes #5287

Revision 223e8d06 (diff)
Added by Abram Connelly about 4 years ago

bug fixes for production deployment, refs #5287

History

#1 Updated by Abram Connelly over 4 years ago

Some comments on the update GET-Evidence code:

- The '$HOME/.config/arvados/settings.conf' file needs to exist with appropriate keys. We need to discuss what kind of account this should be. For now this is my (abram) account.

- Until 'arv-run-pipeline-instance' is updated to use the 'settings.conf' The '$HOME/.config/arvados/settings.sh' file needs to exist that exports the appropriate Arvados API keys so that they get picked up by 'arv-run-pipeline-instance'.

- The pipeline consists of two 'legs', the first of which generates the GFF with the initial annotaitons and initial report, and the second of which 'refreshes' the report. As of 2015-02-20 this takes roughly 30mins on Google Compute Platform from start to finish. Considering 30mins is relatively quick, partial progress reporting is effectively disabled and Tapestry will show 'unknown' until the job has finished (either successfully or unsuccessfully).

- Pipeline submission requires the filename appended to the portable data hash. For example 'cafecafecafecafecafecafecafecafe+255/filetoprocess.tsv.bz2', as opposed to how it was previously (just requiring the portable data hash and nothing else).

- The source file download functionality now makes two calls to the Arvados API, first getting the manifest to get the file length then redirecting the 'arv-get' output for the download. If the 'input.locator' symlink file does not have a file name appeneded to a portable data hash it's smart enough to find the appropriate file anyway so that it will support the new 'input.locator' symlinks as well as the old.

- The '/home/trait/upload/<PDH>-out' directroy gets populated on successfull pipeline completion and after a 'status' call to GET-Evidence has been issued. This means the first 'status' call after pipeline completion might take a while to download the data and populate the directory.

#2 Updated by Ward Vandewege about 4 years ago

  • Target version set to Upgrade work

#3 Updated by Ward Vandewege about 4 years ago

Reviewing 8636fddafed51ba10720d9e40a29401a4eb8ca33

A few small comments:

- an empty line is being introduced at the end of public_html/lib/whpipeline.php, please remove it
- in public_html/submit_GE_pipeline.php, this

opt="" 
opt=" $opt GenomeAnalyzer::INPUT_SAMPLE=$input_sample" 

export HOME="/home/trait" 
z=`. $HOME/.config/arvados/settings.sh && arv-run-pipeline-instance --submit --template $template_uuid $opt`
echo $z

should probably just become

opt="GenomeAnalyzer::INPUT_SAMPLE=$input_sample" 

export HOME="/home/trait" 
z=`. $HOME/.config/arvados/settings.sh && arv-run-pipeline-instance --submit --template $template_uuid $opt`
echo $z

Also; shouldn't that be

$HOME/.config/arvados/settings.conf

?

- In public_html/lib/genome_display.php I see these:

        $cmd = 'export HOME=/home/trait && arv pipeline_instance get --uuid '.escapeshellarg($uuid);
        $pipeline = json_decode(shell_exec($cmd), true);
              $cmd = 'export HOME=/home/trait && . $HOME/.config/arvados/settings.conf && flock --wait 1 --exclusive --nonblock '

Using putenv once would allow removing of both export statements.

In the second line, I'm pretty sure the explicit loading of settings.conf is unnecessary since all that seems to follow are a few calls to arv-get, which will discover that file automatically.

- the hardcoding of the pipeline template uuid in public_html/submit_GE_pipeline seems suboptimal. Maybe that should go in a configuration parameter?

- finally; maybe you should add a small README that explains what the dependencies are for this functionality. It should mention that a .config/arvados/settings.conf file is needed, the arv tools need to be installed.

#4 Updated by Abram Connelly about 4 years ago

- an empty line is being introduced at the end of public_html/lib/whpipeline.php, please remove it

Fixed

 

- in public_html/submit_GE_pipeline.php, this

opt="" 
opt=" $opt GenomeAnalyzer::INPUT_SAMPLE=$input_sample" 

export HOME="/home/trait" 
z=`. $HOME/.config/arvados/settings.sh && arv-run-pipeline-instance --submit --template $template_uuid $opt`
echo $z

should probably just become

opt="GenomeAnalyzer::INPUT_SAMPLE=$input_sample" 

export HOME="/home/trait" 
z=`. $HOME/.config/arvados/settings.sh && arv-run-pipeline-instance --submit --template $template_uuid $opt`
echo $z

Fixed

 

Also; shouldn't that be

$HOME/.config/arvados/settings.conf

?

See issue #5385 ("[SDKs] arv-run-pipeline-instance does not use 'settings.conf' file like other 'arv' tools").

 

- In public_html/lib/genome_display.php I see these:

$cmd = 'export HOME=/home/trait && arv pipeline_instance get --uuid '.escapeshellarg($uuid);
$pipeline = json_decode(shell_exec($cmd), true);
$cmd = 'export HOME=/home/trait && . $HOME/.config/arvados/settings.conf && flock --wait 1 --exclusive --nonblock '

Using putenv once would allow removing of both export statements.

In the second line, I'm pretty sure the explicit loading of settings.conf is unnecessary since all that seems to follow are a few calls to arv-get, which will discover that file automatically.

Fixed

 

- the hardcoding of the pipeline template uuid in public_html/submit_GE_pipeline seems suboptimal. Maybe that should go in a configuration parameter?

So in addition to settings.conf and settings.sh we have yet another config file? It presumably also lives in $HOME/.config/arvados? Is it a shell script? A JSON file that gets parsed? Should we make it more general for when we scrap the current PHP GE for something newer or should it be a one-off?

My opinion is that since the main motivation is to get some small group of genomes through Tapestry/GET-Evidence to give back to participants for approval, making the nice, more general solution can be delayed until we have a clearer vision of how to re-organize GET-Evidence. Until then, keeping a hard-coded pipeline template is not ideal but better than having a config file stuffed in at the last minute.

- finally; maybe you should add a small README that explains what the dependencies are for this functionality. It should mention that a .config/arvados/settings.conf file is needed, the arv tools need to be installed.

Will do. Where should this README be located?

#5 Updated by Abram Connelly about 4 years ago

submit_GE_pipeline now points to /home/trait/.config/arvados/config.json and gets the "get-evidence-pipeline" value to use as it's project UUID.

INSTALL has been updated with instructions on how to install the Arvados command line tools, the needed 'jq' tools and the config files and their locations.

#6 Updated by Abram Connelly about 4 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF