Idea #3550
closed[SDKs] arv-run-pipeline-instance supports running jobs locally using arv-crunch-job
Updated by Ward Vandewege over 10 years ago
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-09-17 sprint
Updated by Tom Clegg over 10 years ago
- Category set to SDKs
- Assigned To set to Tom Clegg
Updated by Tom Clegg over 10 years ago
Still some bugs to work out (perhaps arv-crunch-job has regressed). On lightning-dev2:
pipline template: qr1hi-p5p6p-ya3t583ormtx53j /tmp/arv-run-pipeline-instance --run-jobs-here --template qr1hi-p5p6p-ya3t583ormtx53j ChopGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt ChopGFF::GFF_COLLECTION_LIST=e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list CreateTileSetFromChoppedGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt CreateTileSetFromChoppedGFF::SEED=12345678 CreateTileSetFromChoppedGFF::REFFJ=bcc2937114336754892572e5751974d0+76578 CreateTileSetFromChoppedGFF::HG19FA=fee29077095fed2e695100c299f11dc5+2727 2014-08-29 20:18:15 +0000 -- pipeline_instance qr1hi-d1hrv-41ajivmwe9q1tnj ChopGFF qr1hi-8i9sb-gs3stt9v1c9grx2 starting CreateChoppedGFFDirList - - CreateTileSetFromChoppedGFF - - arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 started qr1hi-8i9sb-gs3stt9v1c9grx2 19765 check slurm allocation qr1hi-8i9sb-gs3stt9v1c9grx2 19765 node localhost - 1 slots qr1hi-8i9sb-gs3stt9v1c9grx2 19765 start qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Install revision 055c0532030c4d95c2767a6eb3438018a212ef35 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Clean-work-dir exited 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Install exited 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script run-command qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script_version 055c0532030c4d95c2767a6eb3438018a212ef35 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 script_parameters {"GFF_COLLECTION_LIST":"e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list","gffFile":"$(file $(GFF_COLLECTION_LIST))","task.foreach":"gffFile","CYTOBAND":"crunch_scripts/data/ucsc.cytoband.hg19.txt","command":["$(job.srcdir)/crunch_scripts/chopGffShim","$(file $(gffFile))","$(job.srcdir)/$(CYTOBAND)"]} qr1hi-8i9sb-gs3stt9v1c9grx2 19765 runtime_constraints {"max_tasks_per_node":0} qr1hi-8i9sb-gs3stt9v1c9grx2 19765 start level 0 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 0 running, 1 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 job_task qr1hi-ot0gb-6ykttjrbmj3crcg qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 started on localhost.1 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 1 running, 0 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: Running [stdbuf --output=0 --error=0 perl - /tmp/crunch-job-4010/src/crunch_scripts/run-command] qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct///cpuacct.stat qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio///blkio.io_service_bytes qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset///cpuset.cpus qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: cpuset.cpus 1 qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 on localhost.1 exit 0 signal 0 success= qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 failure (#1, permanent) after 2 seconds qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 output qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Every node has failed -- giving up on this round qr1hi-8i9sb-gs3stt9v1c9grx2 19765 wait for last 0 children to finish qr1hi-8i9sb-gs3stt9v1c9grx2 19765 status: 0 done, 0 running, 0 todo qr1hi-8i9sb-gs3stt9v1c9grx2 19765 Freeze not implemented qr1hi-8i9sb-gs3stt9v1c9grx2 19765 collate usage: arv-put [-h] [--max-manifest-depth N] [--project-uuid UUID] [--name NAME] [--as-stream | --stream | --as-manifest | --in-manifest | --manifest | --as-raw | --raw] [--use-filename FILENAME] [--filename FILENAME] [--progress | --no-progress | --batch-progress] [--resume | --no-resume] [path [path ...]] arv-put: error: unrecognized arguments: --portable-data-hash system arv-put --portable-data-hash --filename ''qr1hi\-8i9sb\-gs3stt9v1c9grx2\.log\.txt \/tmp\/vLvzu6Tm3s failed: 512 at /usr/local/rvm/gems/ruby-2.1.1/gems/arvados-cli-0.1.20140829123712/bin/crunch-job line 1342, <DATA> line 1. arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 exit 512
Updated by Peter Amstutz over 10 years ago
This help text is confusing:
opt(:run_jobs_here, "Manage the pipeline instance in-process. Find/run/watch jobs until the pipeline finishes (or fails). Implies --run-pipeline-here.",
Should just say something like "Manage the pipeline instance in-process. Run jobs on the local system using arv-crunch-job."
Updated by Tom Clegg over 10 years ago
Peter Amstutz wrote:
This help text is confusing:
Indeed. Changed to: "Run jobs in the local terminal session instead of submitting them to Crunch. Implies --run-pipeline-here."
WRT the fact that running jobs locally often doesn't work even when this particular piece does what it's supposed to, I also added to this help text: "Note: this results in a significantly different job execution environment, and some Crunch features are not supported. It can be necessary to modify a pipeline in order to make it run this way."
(Hoping not to let this branch age more than necessary while waiting for crunch-job's part to be addressed.)
Updated by Brett Smith over 10 years ago
Reviewing 672df7e. All the UUIDs in this comment are dedicated test UUIDs and they're OK to be public.
Trying to run with a local pipeline template on shell.qr1hi, I got this crash:
~$ ruby arv-run-pipeline-instance --no-reuse --run-pipeline-here --template <filename> <parameters> 2014-09-16 19:08:30 +0000 -- pipeline_instance qr1hi-d1hrv-1fcln57dej64gjn c1_grep qr1hi-8i9sb-asjt4xqllw4dw62 {:done=>0, :running=>0, :failed=>0, :todo=>1} c2_hash - - arv-run-pipeline-instance 3104: names: Test two components [Brett] arv-run-pipeline-instance:368:in `fetch_template': undefined method `match' for nil:NilClass (NoMethodError) from arv-run-pipeline-instance:604:in `block in run' from arv-run-pipeline-instance:495:in `each' from arv-run-pipeline-instance:495:in `run' from arv-run-pipeline-instance:808:in `<main>'
I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.
The help text for --submit
includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.
Thanks.
Updated by Tom Clegg over 10 years ago
Brett Smith wrote:
Trying to run with a local pipeline template on shell.qr1hi, I got this crash:
Hm, seems like there's a spurious call to fetch_template there, which crashes if you use "--template" with a filename argument and don't have a [bogus?] UUID in your JSON file.
Presumably the name should come from the template actually being used, which is already in @template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a default component[:output_name]
.
I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.
Do you mean in crunch-job, or arv-run-pipeline-instance? (afaict the above bug is unrelated to this change -- except that people are more likely to hit it if "run locally" is actually useful -- but this comment makes me wonder whether you see/suspect related bugs that I'm still not noticing...? crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)
The help text for
--submit
includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.
Indeed, fixed.
Thanks.
Updated by Brett Smith over 10 years ago
Reviewing d35d434. Earlier disclaimer about UUIDs still applies.
Tom Clegg wrote:
Presumably the name should come from the template actually being used, which is already in
@template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a defaultcomponent[:output_name]
.
I think your new conditions need another andand
after length
. Otherwise, they check nil > 0
and crash with:
arv-run-pipeline-instance:600:in `block in run': undefined method `>' for nil:NilClass (NoMethodError) from arv-run-pipeline-instance:495:in `each' from arv-run-pipeline-instance:495:in `run' from arv-run-pipeline-instance:808:in `<main>'
(crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)
Yeah, that's all I meant. I realize it's not about your branch per se, just about dusting off cobwebs…
With andand
added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:
2014-09-16 20:23:40 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v {:done=>2, :running=>2, :failed=>0, :todo=>0} c2_hash - - arv-run-pipeline-instance 2674: Creating collection {:owner_uuid=>"qr1hi-tpzed-5jakibnrp1qpty1", :name=>"Output d2b1b0a4 of c1_grep of Test two components [Brett] 2", :portable_data_hash=>"d2b1b0a48fce8ea595b1a99a9872709a+155", :manifest_text=>". 0f1d6bcf55c34bed7f92a805d2d89bbf+12+A... 0:12:alice.txt\n. d41d8cd98f00b204e9800998ecf8427e+0+A... 0:0:bob.txt\n. 8f3b36aff310e06f3c5b9e95678ff77a+12+A... 0:12:carol.txt\n"} 2014-09-16 20:23:51 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw queued 2014-09-16T20:23:51Z 2014-09-16 20:24:02 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw {:done=>0, :running=>0, :failed=>0, :todo=>1} arv-run-pipeline-instance 2674: Could not find a collection with portable data hash 2014-09-16 20:24:12 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155 c2_hash qr1hi-8i9sb-inbkem7rs9z47yw
Updated by Tom Clegg over 10 years ago
Brett Smith wrote:
Reviewing d35d434. Earlier disclaimer about UUIDs still applies.
Tom Clegg wrote:
Presumably the name should come from the template actually being used, which is already in
@template
at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a defaultcomponent[:output_name]
.I think your new conditions need another
andand
afterlength
. Otherwise, they checknil > 0
and crash with:
Indeed, sorry. Fixed both.
With
andand
added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:
I'm not sure exactly why your a-r-p-i decided to use a real job there (e.g., qualifying job was already running on server) but I suspect the "portable data hash" complaint comes from a race condition caused by crunch-job: at the end of the job, it sets success/running/finished_at attributes, then does some other work, then sets the output and log attributes. While it's doing the "other work", a-r-p-i notices that the job has finished and tries to do stuff with its output hash. I've addressed this by moving the "say you're finished" stuff in crunch-job down so success=true and finished_at=something actually indicates the job is finished (including saving output and log), not just nearly-finished.
Also fixed a bug that would cause local jobs to run again, needlessly, if a-r-p-i ended up going through its update loop again (e.g., there are cases when "moretodo" is true even though all jobs have finished -- a 10-second-wasting but otherwise harmless bug which I decided not to try to fix right now).
Now at 2da969c
Updated by Tom Clegg over 10 years ago
- Target version changed from 2014-09-17 sprint to Arvados Future Sprints
Updated by Tom Clegg over 10 years ago
- Target version changed from Arvados Future Sprints to 2014-10-08 sprint
Updated by Brett Smith over 10 years ago
- Target version changed from 2014-10-08 sprint to 2014-09-17 sprint
Updated by Anonymous over 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:dce0ccabe3d9fab6943e89dc84050793cca5b553.