Feature #426


Use compute cloud for back-end processing

Added by Ward Vandewege over 11 years ago. Updated about 11 years ago.

In Progress
Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


We need to modify the background processing code so it can run on a "fresh" node:
  • Pre-process reference data (refFlat, hg18.2bit, hg19.2bit) and put it in warehouse storage
  • Make mr-get-evidence wrapper:
    • in step 0, scan the input, queue 1 jobstep per chromosome, and output the comments/metadata
    • fetch/extract the reference data (if not already extracted by previous jobstep)
    • grep for the desired chromosome, sort, do the rest of the processing

We should still support single-node installations. For this case we need a mechanism to prevent the server from overtaxing itself if many jobs are submitted at once (e.g., by default, max # concurrent jobs = # cpus).

  • Possible solution: Try to flock() one of N lockfiles in /home/trait/lock/slot.X. If all are already locked, wait random# seconds and try again. When a flock succeeds, start the job (pass the lock to the job process, so the lock releases when the process quits).

The xmlrpc server should be replaced with a job queue. The web gui should submit a job by inserting a row into a MySQL table.

The background service (probably running on the same machine as the webgui) will check the queue every few seconds (and when triggered by webgui via named socket or something). For each job in the queue:
  • Just delete it if we've already started/queued a process for this dataset.
  • If cloud processing is available, submit a batch job and note job# J and queuetime
  • Start a local job if local processing slots are available and...
    • cloud processing is not available, or
    • a batch job was submitted for this data set but failed, or
    • a batch job was submitted for this data set >30 seconds ago and that job hasn't started yet (cloud is busy)
  • If the batch job J for this data set has succeeded:
    • Make a symlink or something in {hash}-out/ so the web gui knows the results are available.
    • Delete the queue entry.
    • If there are some results in {hash}-out/ns.gff.gz etc. from previous analyses, delete them.
    • Get a local copy of the get-evidence.json file from the warehouse, but wait to get the other stuff from the warehouse until someone downloads them.
  • Copy the uploaded data to the cloud in the background service, while checking for new items in the queue. Make a symlink genotype.gff.archive -> warehouse:///{hash}/input.gff.gz
  • If user provides a warehouse:/// url instead of file:///, just make the genotype.gff.archive symlink instead of copying the file to local storage.

Related issues

Blocked by GET-Evidence - Feature #521: Convert to use flatfile instead of MySQLResolvedMadeleine Ball02/16/2011

Actions #1

Updated by Tom Clegg over 11 years ago

  • Subject changed from we need a job queuing system in t-o-m to Use compute cloud for back-end processing
  • Status changed from New to In Progress
Actions #2

Updated by Tom Clegg over 11 years ago

  • Assigned To set to Tom Clegg
Actions #3

Updated by Tom Clegg about 11 years ago

Current state of affairs:

There is a mr-function called "get-evidence" (should be renamed to "genome-analyzer"?) which performs the slow genome_analyzer step.

Example: process the 40 public CGI genomes (in "var" format, *.tsv.bz2), using 11 nodes, and (the default of) 4 concurrent jobs per node:

wh job new \
mrfunction=get-evidence \
inputkey=b08ef11569d00ba65f3449e667624565+11052+K@templeton \
DATA_TARBALL=a8948d1a428977c9dce50415b2e5938b+1476+K@templeton/analysis_data.tar.gz \
GETEV_JSON=b3bcf3eb95f7cc890fba996056cc8ee3+86+K@templeton/getev-latest.json.gz \
GET_VERSION=c24081445fbb41b0a8d50d3abf57efb48116eb1a \
GIT_REPO=git:// \
nodes=11 \

You'll get a job number back.


Wait for it to finish, and print the output locator:

wh job wait id=43207

(This took ~5 hours but probably should have taken much less -- one node that kept segfaulting and holding up the show.)

Get a list of directories in the output:

whget 958564ef2f7de93025327a5efce3a7bc+12524+K@templeton | cut -f1 -d" "

There is one directory per input file, named {sha256(inputfile)}-out, sort of like you'd get in your "upload" directory (should change sha256 to sha1?). Each contains get-evidence.json, metadata.json, missing_codon.json, and ns.gff.gz.

View a get-evidence.json file:

whless $out/$dir/get-evidence.json

Retrieve the whole result set:

mkdir /tmp/out
whget -rv 958564ef2f7de93025327a5efce3a7bc+12524+K@templeton/ /tmp/out/

Also available in: Atom PDF