Story #3550

[SDKs] arv-run-pipeline-instance supports running jobs locally using arv-crunch-job

Added by Tom Clegg almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
08/29/2014
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
1.0

Subtasks

Task #3772: Review 3550-local-pipelineResolvedTom Clegg

Task #3771: Add --run-jobs-here flagResolvedTom Clegg

Associated revisions

Revision dce0ccab
Added by Tom Clegg over 6 years ago

Merge branch '3550-local-pipeline' closes #3550

Revision a9f3e9ce (diff)
Added by Tom Clegg over 6 years ago

Update Gemfiles to use latest arvados gem. refs #3550

Revision f5fd953b (diff)
Added by Tom Clegg over 6 years ago

Use new --run-pipeline-here instead of --run-here flag, which no longer does what crunch-dispatch wants. refs #3550

Revision fe59fe52 (diff)
Added by Tom Clegg over 6 years ago

Fix Gemfile.lock to use a real gem, not a dev build. refs #3550

History

#1 Updated by Ward Vandewege over 6 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Ward Vandewege over 6 years ago

  • Target version changed from Arvados Future Sprints to 2014-09-17 sprint

#3 Updated by Tom Clegg over 6 years ago

  • Category set to SDKs
  • Assigned To set to Tom Clegg

#4 Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg over 6 years ago

Still some bugs to work out (perhaps arv-crunch-job has regressed). On lightning-dev2:

pipline template: qr1hi-p5p6p-ya3t583ormtx53j
/tmp/arv-run-pipeline-instance --run-jobs-here --template qr1hi-p5p6p-ya3t583ormtx53j ChopGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt ChopGFF::GFF_COLLECTION_LIST=e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list CreateTileSetFromChoppedGFF::CYTOBAND=crunch_scripts/data/ucsc.cytoband.hg19.txt CreateTileSetFromChoppedGFF::SEED=12345678 CreateTileSetFromChoppedGFF::REFFJ=bcc2937114336754892572e5751974d0+76578 CreateTileSetFromChoppedGFF::HG19FA=fee29077095fed2e695100c299f11dc5+2727

2014-08-29 20:18:15 +0000 -- pipeline_instance qr1hi-d1hrv-41ajivmwe9q1tnj
ChopGFF                     qr1hi-8i9sb-gs3stt9v1c9grx2 starting
CreateChoppedGFFDirList     -                           -
CreateTileSetFromChoppedGFF -                           -
arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 started
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  check slurm allocation
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  node localhost - 1 slots
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  start
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  Install revision 055c0532030c4d95c2767a6eb3438018a212ef35
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  Clean-work-dir exited 0
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  Install exited 0
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  script run-command
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  script_version 055c0532030c4d95c2767a6eb3438018a212ef35
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  script_parameters {"GFF_COLLECTION_LIST":"e9e6a6eaa6dca4d5014f04ec8152a02f+66/test_huTileSets.list","gffFile":"$(file $(GFF_COLLECTION_LIST))","task.foreach":"gffFile","CYTOBAND":"crunch_scripts/data/ucsc.cytoband.hg19.txt","command":["$(job.srcdir)/crunch_scripts/chopGffShim","$(file $(gffFile))","$(job.srcdir)/$(CYTOBAND)"]}
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  runtime_constraints {"max_tasks_per_node":0}
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  start level 0
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  status: 0 done, 0 running, 1 todo
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 job_task qr1hi-ot0gb-6ykttjrbmj3crcg
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 started on localhost.1
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  status: 0 done, 1 running, 0 todo
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: Running [stdbuf --output=0 --error=0 perl - /tmp/crunch-job-4010/src/crunch_scripts/run-command]
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuacct///cpuacct.stat
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/blkio///blkio.io_service_bytes
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: reading stats from /sys/fs/cgroup/cpuset///cpuset.cpus
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 stderr crunchstat: cpuset.cpus 1
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 child 19975 on localhost.1 exit 0 signal 0 success=
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 failure (#1, permanent) after 2 seconds
qr1hi-8i9sb-gs3stt9v1c9grx2 19765 0 output
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  Every node has failed -- giving up on this round
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  wait for last 0 children to finish
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  status: 0 done, 0 running, 0 todo
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  Freeze not implemented
qr1hi-8i9sb-gs3stt9v1c9grx2 19765  collate
usage: arv-put [-h] [--max-manifest-depth N] [--project-uuid UUID]
               [--name NAME]
               [--as-stream | --stream | --as-manifest | --in-manifest | --manifest | --as-raw | --raw]
               [--use-filename FILENAME] [--filename FILENAME]
               [--progress | --no-progress | --batch-progress]
               [--resume | --no-resume]
               [path [path ...]]
arv-put: error: unrecognized arguments: --portable-data-hash
system arv-put --portable-data-hash --filename ''qr1hi\-8i9sb\-gs3stt9v1c9grx2\.log\.txt \/tmp\/vLvzu6Tm3s failed: 512 at /usr/local/rvm/gems/ruby-2.1.1/gems/arvados-cli-0.1.20140829123712/bin/crunch-job line 1342, <DATA> line 1.
arv-run-pipeline-instance 19655: arv-crunch-job pid 19765 exit 512

#6 Updated by Peter Amstutz over 6 years ago

This help text is confusing:

  opt(:run_jobs_here,
      "Manage the pipeline instance in-process. Find/run/watch jobs until the pipeline finishes (or fails). Implies --run-pipeline-here.",

Should just say something like "Manage the pipeline instance in-process. Run jobs on the local system using arv-crunch-job."

#7 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

This help text is confusing:

Indeed. Changed to: "Run jobs in the local terminal session instead of submitting them to Crunch. Implies --run-pipeline-here."

WRT the fact that running jobs locally often doesn't work even when this particular piece does what it's supposed to, I also added to this help text: "Note: this results in a significantly different job execution environment, and some Crunch features are not supported. It can be necessary to modify a pipeline in order to make it run this way."

(Hoping not to let this branch age more than necessary while waiting for crunch-job's part to be addressed.)

#8 Updated by Brett Smith over 6 years ago

Reviewing 672df7e. All the UUIDs in this comment are dedicated test UUIDs and they're OK to be public.

Trying to run with a local pipeline template on shell.qr1hi, I got this crash:

~$ ruby arv-run-pipeline-instance --no-reuse --run-pipeline-here --template <filename> <parameters>

2014-09-16 19:08:30 +0000 -- pipeline_instance qr1hi-d1hrv-1fcln57dej64gjn
c1_grep qr1hi-8i9sb-asjt4xqllw4dw62 {:done=>0, :running=>0, :failed=>0, :todo=>1}
c2_hash -                           -
arv-run-pipeline-instance 3104: names:  Test two components [Brett]
arv-run-pipeline-instance:368:in `fetch_template': undefined method `match' for nil:NilClass (NoMethodError)
        from arv-run-pipeline-instance:604:in `block in run'
        from arv-run-pipeline-instance:495:in `each'
        from arv-run-pipeline-instance:495:in `run'
        from arv-run-pipeline-instance:808:in `<main>'

I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.

The help text for --submit includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.

Thanks.

#9 Updated by Tom Clegg over 6 years ago

Brett Smith wrote:

Trying to run with a local pipeline template on shell.qr1hi, I got this crash:

Hm, seems like there's a spurious call to fetch_template there, which crashes if you use "--template" with a filename argument and don't have a [bogus?] UUID in your JSON file.

Presumably the name should come from the template actually being used, which is already in @template at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a default component[:output_name].

I'm guessing that, in general, the "are we running locally?" checks are going to need some sprucing up for this to work.

Do you mean in crunch-job, or arv-run-pipeline-instance? (afaict the above bug is unrelated to this change -- except that people are more likely to hit it if "run locally" is actually useful -- but this comment makes me wonder whether you see/suspect related bugs that I'm still not noticing...? crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)

The help text for --submit includes this bit: "Let the Crunch dispatch service to satisfy…" I think the "to" is extraneous.

Indeed, fixed.

Thanks.

#10 Updated by Brett Smith over 6 years ago

Reviewing d35d434. Earlier disclaimer about UUIDs still applies.

Tom Clegg wrote:

Presumably the name should come from the template actually being used, which is already in @template at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a default component[:output_name].

I think your new conditions need another andand after length. Otherwise, they check nil > 0 and crash with:

arv-run-pipeline-instance:600:in `block in run': undefined method `>' for nil:NilClass (NoMethodError)
        from arv-run-pipeline-instance:495:in `each'
        from arv-run-pipeline-instance:495:in `run'
        from arv-run-pipeline-instance:808:in `<main>'

(crunch-job has definitely fallen behind in its ability to run jobs locally, if that's what you mean.)

Yeah, that's all I meant. I realize it's not about your branch per se, just about dusting off cobwebs…

With andand added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:

2014-09-16 20:23:40 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh
c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v {:done=>2, :running=>2, :failed=>0, :todo=>0}
c2_hash -                           -
arv-run-pipeline-instance 2674: Creating collection {:owner_uuid=>"qr1hi-tpzed-5jakibnrp1qpty1", :name=>"Output d2b1b0a4 of c1_grep of Test two components [Brett] 2", :portable_data_hash=>"d2b1b0a48fce8ea595b1a99a9872709a+155", :manifest_text=>". 0f1d6bcf55c34bed7f92a805d2d89bbf+12+A... 0:12:alice.txt\n. d41d8cd98f00b204e9800998ecf8427e+0+A... 0:0:bob.txt\n. 8f3b36aff310e06f3c5b9e95678ff77a+12+A... 0:12:carol.txt\n"}

2014-09-16 20:23:51 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh
c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155
c2_hash qr1hi-8i9sb-inbkem7rs9z47yw queued 2014-09-16T20:23:51Z

2014-09-16 20:24:02 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh
c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155
c2_hash qr1hi-8i9sb-inbkem7rs9z47yw {:done=>0, :running=>0, :failed=>0, :todo=>1}
arv-run-pipeline-instance 2674: Could not find a collection with portable data hash

2014-09-16 20:24:12 +0000 -- pipeline_instance qr1hi-d1hrv-09fiy7bcpue60fh
c1_grep qr1hi-8i9sb-wrbvm5gxjkx817v d2b1b0a48fce8ea595b1a99a9872709a+155
c2_hash qr1hi-8i9sb-inbkem7rs9z47yw

#11 Updated by Tom Clegg over 6 years ago

Brett Smith wrote:

Reviewing d35d434. Earlier disclaimer about UUIDs still applies.

Tom Clegg wrote:

Presumably the name should come from the template actually being used, which is already in @template at this point, so I've removed the fetch_template call. In the case where the template doesn't have a (non-empty) name, I added a plan C: use the pipeline instance UUID instead of a pipeline template/instance name as a default component[:output_name].

I think your new conditions need another andand after length. Otherwise, they check nil > 0 and crash with:

Indeed, sorry. Fixed both.

With andand added, I can now run locally with a filename --template, huzzah! But running with a UUID --template fails—there seems to be some problem propagating output from one component to the next. Here's all the output after the first component finishes:

I'm not sure exactly why your a-r-p-i decided to use a real job there (e.g., qualifying job was already running on server) but I suspect the "portable data hash" complaint comes from a race condition caused by crunch-job: at the end of the job, it sets success/running/finished_at attributes, then does some other work, then sets the output and log attributes. While it's doing the "other work", a-r-p-i notices that the job has finished and tries to do stuff with its output hash. I've addressed this by moving the "say you're finished" stuff in crunch-job down so success=true and finished_at=something actually indicates the job is finished (including saving output and log), not just nearly-finished.

Also fixed a bug that would cause local jobs to run again, needlessly, if a-r-p-i ended up going through its update loop again (e.g., there are cases when "moretodo" is true even though all jobs have finished -- a 10-second-wasting but otherwise harmless bug which I decided not to try to fix right now).

Now at 2da969c

#12 Updated by Tom Clegg over 6 years ago

  • Target version changed from 2014-09-17 sprint to Arvados Future Sprints

#13 Updated by Tom Clegg over 6 years ago

  • Target version changed from Arvados Future Sprints to 2014-10-08 sprint

#14 Updated by Brett Smith over 6 years ago

  • Target version changed from 2014-10-08 sprint to 2014-09-17 sprint

Tom Clegg wrote:

Now at 2da969c

I think this is good to merge. Thanks.

#15 Updated by Anonymous over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:dce0ccabe3d9fab6943e89dc84050793cca5b553.

Also available in: Atom PDF