Idea #3550
closed[SDKs] arv-run-pipeline-instance supports running jobs locally using arv-crunch-job
Updated by Ward Vandewege over 9 years ago
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 9 years ago
- Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Updated by Tom Clegg over 9 years ago
- Category set to SDKs
- Assigned To set to Tom Clegg
Updated by Tom Clegg over 9 years ago
Still some bugs to work out (perhaps arv-crunch-job has regressed). On lightning-dev2:
pipline template: qr1hi-p5p6p-ya3t583ormtx53j /tmp/arv-run-pipeline-instance --run-jobs-here --template qr1hi-p5p6p-ya3t583ormtx53j ChopGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt ChopGFF::GFF_COLLECTION_LIST=e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list CreateTileSetFromChoppedGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt CreateTileSetFromChoppedGFF::SEED=12345678 CreateTileSetFromChoppedGFF::REFFJ=bcc2937114336754892572e5751974d0+76578 CreateTileSetFromChoppedGFF::HG19FA=fee29077095fed2e695100c299f11dc5+2727 2014-08-29 20:18:15 +0000 -- pipeline_instance qr1hi-d1hrv-41ajivmwe9q1tnj ChopGFF qr1hi-8i9sb-gs3stt9v1c9grx2 starting CreateChoppedGFFDirList - - CreateTileSetFromChoppedGFF - - arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 started qr1hi-8i9sb-gs3stt9v1c9grx2 19765 check slurm allocation qr1hi-8i9sb-gs3stt9v1c9grx2 19765 node localhost - 1 slots qr1hi-8i9sb-gs3stt9v1c9grx2 19765 start qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Install revision 055c0532030c4d95c2767a6eb3438018a212ef35 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Clean-work-dir exited 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Install exited 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script run-command qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script_version 055c0532030c4d95c2767a6eb3438018a212ef35 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script_parameters {"GFF_COLLECTION_LIST":"e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list","gffFile":"$(file $(GFF_COLLECTION_LIST))","task.foreach":"gffFile","CYTOBAND":"crunch_scripts/data/ucsc.cytoband.hg19.txt","command":["$(job.srcdir)/crunch_scripts/chopGffShim","$(file $(gffFile))","$(job.srcdir)/$(CYTOBAND)"]} qr1hi-8i9sb-gs3stt9v1c9grx2 19765 runtime_constraints {"max_tasks_per_node":0} qr1hi-8i9sb-gs3stt9v1c9grx2 19765 start level 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 0 running, 1 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 job_task qr1hi-ot0gb-6ykttjrbmj3crcg qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 started on localhost.1 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 1 running, 0 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: Running [stdbuf --output=0 --error=0 perl - /tmp/crunch-job-4010/src/crunch_scripts/run-command] qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct///cpuacct.stat qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio///blkio.io_service_bytes qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset///cpuset.cpus qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: cpuset.cpus 1 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 on localhost.1 exit 0 signal 0 success= qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 failure (#1, permanent) after 2 seconds qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 output qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Every node has failed -- giving up on this round qr1hi-8i9sb-gs3stt9v1c9grx2 19765 wait for last 0 children to finish qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 0 running, 0 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Freeze not implemented qr1hi-8i9sb-gs3stt9v1c9grx2 19765 collate usage: arv-put [-h] [--max-manifest-depth N] [--project-uuid UUID] [--name NAME] [--as-stream | --stream | --as-manifest | --in-manifest | --manifest | --as-raw | --raw] [--use-filename FILENAME] [--filename FILENAME] [--progress | --no-progress | --batch-progress] [--resume | --no-resume] [path [path ...]] arv-put: error: unrecognized arguments: --portable-data-hash system arv-put --portable-data-hash --filename ''qr1hi\-8i9sb\-gs3stt9v1c9grx2\.log\.txt \/tmp\/vLvzu6Tm3s failed: 512 at /usr/local/rvm/gems/ruby-2.1.1/gems/arvados-cli-0.1.20140829123712/bin/crunch-job line 1342, <DATA> line 1. arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 exit 512
Updated by Peter Amstutz over 9 years ago
This help text is confusing:
opt(:run_jobs_here, "Manage the pipeline instance in-process. Find/run/watch jobs until the pipeline finishes (or fails). Implies --run-pipeline-here.",
Should just say something like "Manage the pipeline instance in-process. Run jobs on the local system using arv-crunch-job."
Updated by Tom Clegg over 9 years ago
Peter Amstutz wrote:
This help text is confusing:
Indeed. Changed to: "Run jobs in the local terminal session instead of submitting them to Crunch. Implies --run-pipeline-here."
WRT the fact that running jobs locally often doesn't work even when this particular piece does what it's supposed to, I also added to this help text: "Note: this results in a significantly different job execution environment, and some Crunch features are not supported. It can be necessary to modify a pipeline in order to make it run this way."
(Hoping not to let this branch age more than necessary while waiting for crunch-job's part to be addressed.)
Updated by Brett Smith over 9 years ago
Reviewing 672df7e. All the UUIDs in this comment are dedicated test UUIDs and they're OK to be public.
Trying to run with a local pipeline template on shell.qr1hi, I got this crash:
~$ ruby arv-run-pipeline-instance --no-reuse --run-pipeline-here --template <filename> <parameters> 2014-09-16 19:08:30 +0000 -- pipeline_instance qr1hi-d1hrv-1fcln57dej64gjn c1_grep qr1hi-8i9sb-asjt4xqllw4dw62 {:done=>0, :running=>0, :failed=>0, :todo=>1} c2_hash - - arv-run-pipeline-instance 3104: names: Test two components [Brett] arv-run-pipeline-instance:368:in `fetch_template': undefined method `match' for nil:NilClass (NoMethodError) from arv-run-pipeline-instance:604:in `block in run' from arv-run-pipeline-instance:495:in `each' from arv-run-pipeline-instance:495:in `run' from arv-run-pipeline-instance:808:in `<main>'
I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.
The help text for --submit
includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.
Thanks.
Updated by Tom Clegg over 9 years ago
Brett Smith wrote:
Trying to run with a local pipeline template on shell.qr1hi, I got this crash:
Hm, seems like there's a spurious call to fetch_template there, which crashes if you use "--template" with a filename argument and don't have a [bogus?] UUID in your JSON file.
Presumably the name should come from the template actually being used, which is already in @template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a default component[:output_name]
.
I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.
Do you mean in crunch-job, or arv-run-pipeline-instance? (afaict the above bug is unrelated to this change -- except that people are more likely to hit it if "run locally" is actually useful -- but this comment makes me wonder whether you see/suspect related bugs that I'm still not noticing...? crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)
The help text for
--submit
includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.
Indeed, fixed.
Thanks.
Updated by Brett Smith over 9 years ago
Reviewing d35d434. Earlier disclaimer about UUIDs still applies.
Tom Clegg wrote:
Presumably the name should come from the template actually being used, which is already in
@template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a defaultcomponent[:output_name]
.
I think your new conditions need another andand
after length
. Otherwise, they check nil > 0
and crash with:
arv-run-pipeline-instance:600:in `block in run': undefined method `>' for nil:NilClass (NoMethodError) from arv-run-pipeline-instance:495:in `each' from arv-run-pipeline-instance:495:in `run' from arv-run-pipeline-instance:808:in `<main>'
(crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)
Yeah, that's all I meant. I realize it's not about your branch per se, just about dusting off cobwebs…
With andand
added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:
2014-09-16 20:23:40 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v {:done=>2, :running=>2, :failed=>0, :todo=>0} c2_hash - - arv-run-pipeline-instance 2674: Creating collection {:owner_uuid=>"qr1hi-tpzed-5jakibnrp1qpty1", :name=>"Output d2b1b0a4 of c1_grep of Test two components [Brett] 2", :portable_data_hash=>"d2b1b0a48fce8ea595b1a99a9872709a+155", :manifest_text=>". 0f1d6bcf55c34bed7f92a805d2d89bbf+12+A... 0:12:alice.txt\n. d41d8cd98f00b204e9800998ecf8427e+0+A... 0:0:bob.txt\n. 8f3b36aff310e06f3c5b9e95678ff77a+12+A... 0:12:carol.txt\n"} 2014-09-16 20:23:51 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw queued 2014-09-16T20:23:51Z 2014-09-16 20:24:02 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw {:done=>0, :running=>0, :failed=>0, :todo=>1} arv-run-pipeline-instance 2674: Could not find a collection with portable data hash 2014-09-16 20:24:12 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw
Updated by Tom Clegg over 9 years ago
Brett Smith wrote:
Reviewing d35d434. Earlier disclaimer about UUIDs still applies.
Tom Clegg wrote:
Presumably the name should come from the template actually being used, which is already in
@template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a defaultcomponent[:output_name]
.I think your new conditions need another
andand
afterlength
. Otherwise, they checknil > 0
and crash with:
Indeed, sorry. Fixed both.
With
andand
added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:
I'm not sure exactly why your a-r-p-i decided to use a real job there (e.g., qualifying job was already running on server) but I suspect the "portable data hash" complaint comes from a race condition caused by crunch-job: at the end of the job, it sets success/running/finished_at attributes, then does some other work, then sets the output and log attributes. While it's doing the "other work", a-r-p-i notices that the job has finished and tries to do stuff with its output hash. I've addressed this by moving the "say you're finished" stuff in crunch-job down so success=true and finished_at=something actually indicates the job is finished (including saving output and log), not just nearly-finished.
Also fixed a bug that would cause local jobs to run again, needlessly, if a-r-p-i ended up going through its update loop again (e.g., there are cases when "moretodo" is true even though all jobs have finished -- a 10-second-wasting but otherwise harmless bug which I decided not to try to fix right now).
Now at 2da969c
Updated by Tom Clegg over 9 years ago
- Target version changed from 2014-09-17 sprint to Arvados Future Sprints
Updated by Tom Clegg over 9 years ago
- Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Updated by Brett Smith over 9 years ago
- Target version changed from 2014-10-08 sprint to 2014-09-17 sprint
Updated by Anonymous over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:dce0ccabe3d9fab6943e89dc84050793cca5b553.