Feature #426

Use compute cloud for back-end processing

Added by Ward Vandewege about 11 years ago. Updated almost 11 years ago.

In Progress
Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


We need to modify the background processing code so it can run on a "fresh" node:
  • Pre-process reference data (refFlat, hg18.2bit, hg19.2bit) and put it in warehouse storage
  • Make mr-get-evidence wrapper:
    • in step 0, scan the input, queue 1 jobstep per chromosome, and output the comments/metadata
    • fetch/extract the reference data (if not already extracted by previous jobstep)
    • grep for the desired chromosome, sort, do the rest of the processing

We should still support single-node installations. For this case we need a mechanism to prevent the server from overtaxing itself if many jobs are submitted at once (e.g., by default, max # concurrent jobs = # cpus).

  • Possible solution: Try to flock() one of N lockfiles in /home/trait/lock/slot.X. If all are already locked, wait random# seconds and try again. When a flock succeeds, start the job (pass the lock to the job process, so the lock releases when the process quits).

The xmlrpc server should be replaced with a job queue. The web gui should submit a job by inserting a row into a MySQL table.

The background service (probably running on the same machine as the webgui) will check the queue every few seconds (and when triggered by webgui via named socket or something). For each job in the queue:
  • Just delete it if we've already started/queued a process for this dataset.
  • If cloud processing is available, submit a batch job and note job# J and queuetime
  • Start a local job if local processing slots are available and...
    • cloud processing is not available, or
    • a batch job was submitted for this data set but failed, or
    • a batch job was submitted for this data set >30 seconds ago and that job hasn't started yet (cloud is busy)
  • If the batch job J for this data set has succeeded:
    • Make a symlink or something in {hash}-out/ so the web gui knows the results are available.
    • Delete the queue entry.
    • If there are some results in {hash}-out/ns.gff.gz etc. from previous analyses, delete them.
    • Get a local copy of the get-evidence.json file from the warehouse, but wait to get the other stuff from the warehouse until someone downloads them.
  • Copy the uploaded data to the cloud in the background service, while checking for new items in the queue. Make a symlink genotype.gff.archive -> warehouse:///{hash}/input.gff.gz
  • If user provides a warehouse:/// url instead of file:///, just make the genotype.gff.archive symlink instead of copying the file to local storage.

Related issues

Blocked by GET-Evidence - Feature #521: Convert gff_getevidence_map.py to use flatfile instead of MySQLResolved02/16/2011

Associated revisions

Revision 940417cc (diff)
Added by Madeleine Ball almost 11 years ago

Convert gff_getevidence_map.py to use a flatfile

'gff_getevidence_map.py': This previously used MySQL queries to look
for GET-Evidence hits. This was the only remaining module used by
the genome analysis server that used MySQL - this was blocking us from
sending genome processing jobs to the cloud. It now uses a
JSON-formatted flat file containing GET-Evidence data. In addition
it can now be called as a GFF-string generator, allowing it to be
added to our generator chaining.

'download.php': A function is added to generate the JSON-formatted
file that gff_getevidence_map.py needs.

'Makefile' was updated to generate the file using 'download.php' when
'make daily' is run.

'INSTALL': Instructions for setting up the 'make daily' cron job and
an initial run of it is added.

'UPGRADE': Instruction to run 'make daily' is added.

'.gitignore': A line to ignore the file made is added.

'config.default.py': A name referring to the file is added.

'trait-o-matic-server.py': Updated to correctly call the new

This commit fixes #521 and references #426
Because latest-flat.tsv wasn't used, it also closes #520

Revision 1b7a5103 (diff)
Added by Tom Clegg over 9 years ago

use whpipeline to process data submitted via API. closes #947 refs #426


#1 Updated by Tom Clegg almost 11 years ago

  • Subject changed from we need a job queuing system in t-o-m to Use compute cloud for back-end processing
  • Status changed from New to In Progress

#2 Updated by Tom Clegg almost 11 years ago

  • Assigned To set to Tom Clegg

#3 Updated by Tom Clegg almost 11 years ago

Current state of affairs:

There is a mr-function called "get-evidence" (should be renamed to "genome-analyzer"?) which performs the slow genome_analyzer step.

Example: process the 40 public CGI genomes (in "var" format, *.tsv.bz2), using 11 nodes, and (the default of) 4 concurrent jobs per node:

wh job new \
mrfunction=get-evidence \
inputkey=b08ef11569d00ba65f3449e667624565+11052+K@templeton \
DATA_TARBALL=a8948d1a428977c9dce50415b2e5938b+1476+K@templeton/analysis_data.tar.gz \
GETEV_JSON=b3bcf3eb95f7cc890fba996056cc8ee3+86+K@templeton/getev-latest.json.gz \
GET_VERSION=c24081445fbb41b0a8d50d3abf57efb48116eb1a \
GIT_REPO=git://git.clinicalfuture.com/get-evidence.git \
nodes=11 \

You'll get a job number back.


Wait for it to finish, and print the output locator:

wh job wait id=43207

(This took ~5 hours but probably should have taken much less -- one node that kept segfaulting and holding up the show.)

Get a list of directories in the output:

whget 958564ef2f7de93025327a5efce3a7bc+12524+K@templeton | cut -f1 -d" "

There is one directory per input file, named {sha256(inputfile)}-out, sort of like you'd get in your "upload" directory (should change sha256 to sha1?). Each contains get-evidence.json, metadata.json, missing_codon.json, and ns.gff.gz.

View a get-evidence.json file:

whless $out/$dir/get-evidence.json

Retrieve the whole result set:

mkdir /tmp/out
whget -rv 958564ef2f7de93025327a5efce3a7bc+12524+K@templeton/ /tmp/out/

Also available in: Atom PDF