Feature #10081

[CWL] Run several steps in single job

Added by Peter Amstutz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/16/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Add workflow hint "arv:RunInSingleContainer" which uses cwltool to run a subworkflow as a single job in order to amortize the overhead of spinning up new jobs.


Subtasks

Task #10086: Support RunInSingleContainer hintResolvedPeter Amstutz

Task #10087: Review 10081-cwl-run-same-jobResolvedRadhika Chippada

Associated revisions

Revision 523dadeb
Added by Peter Amstutz over 4 years ago

Merge branch '10081-cwl-run-same-job' closes #10081

Revision 69972d44
Added by Peter Amstutz over 4 years ago

Merge branch '10081-update-cwl-runner' refs #10081

History

#1 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#3 Updated by Peter Amstutz over 4 years ago

  • Assigned To set to Peter Amstutz

#4 Updated by Radhika Chippada over 4 years ago

  • TestWorkflow failing with run-tests because scatter2.cwl is not in tests/wf dir (it is in tests dir)
  • I copied it into the tests/wf dir, but still failing (I did a reinstall as well)
======================================================================
ERROR: test_run (tests.test_job.TestWorkflow)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/radhika/arvados/sdk/cwl/tests/test_job.py", line 220, in test_run
    it.next().run()
  File "/home/radhika/arvados/sdk/cwl/arvados_cwl/arvjob.py", line 48, in run
    n.write(p.resolved.encode("utf-8"))
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 59, in __exit__
    self.close()
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 1101, in close
    self.flush()
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 51, in before_close_wrapper
    return orig_func(self, *args, **kwargs)
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 1097, in flush
    self.arvadosfile.flush()
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 238, in synchronized_wrapper
    return orig_func(self, *args, **kwargs)
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 936, in flush
    self.parent._my_block_manager().commit_bufferblock(self._current_bblock, sync=sync)
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/arvfile.py", line 587, in commit_bufferblock
    loc = self._keep.put(block.buffer_view[0:block.write_pointer].tobytes(), copies=self.copies)
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/retry.py", line 158, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/tmp/tmp.VV3pxv7gTR/VENVDIR/local/lib/python2.7/site-packages/arvados/keep.py", line 1096, in put
    data_hash, copies, writer_pool.done()), service_errors, label="service")
KeepWriteError: failed to write 0c17b076db9ae2ee0b7250d3db394952 (wanted 2 copies but wrote 0): service http://keep1.zzzzz.arvadosapi.com:25107/ responded with 0 (7, 'Failed to connect to keep1.zzzzz.arvadosapi.com port 25107: Connection refused'); service http://keep0.zzzzz.arvadosapi.com:25107/ responded with 0 (28, 'Connection timed out after 2002 milliseconds')

#5 Updated by Peter Amstutz over 4 years ago

The tests are fixed, thanks for catching that. Please take another look.

#6 Updated by Radhika Chippada over 4 years ago

  • “raise Exception("Uh oh %s" % obj["location"])” -- may be you can clarify that the location be keep locator with so and so format?
  • Does this update result in any unwanted “sequential” ordering of running jobs (instead of parallelization) resulting in longer test run times?

#7 Updated by Peter Amstutz over 4 years ago

Radhika Chippada wrote:

No, it is just pre-populating a cache, so it won't ever try to download from that URL. However I realize I should should probably change the URI to http://arvados.org/cwl to be consistent with the namespacing of the Arvados hints.

  • “raise Exception("Uh oh %s" % obj["location"])” -- may be you can clarify that the location be keep locator with so and so format?

Ooops, that was a debugging check that should be removed.

  • Does this update result in any unwanted “sequential” ordering of running jobs (instead of parallelization) resulting in longer test run times?

This feature intentionally runs a series of steps in a single job using cwltool. Currently cwltool doesn't parallelize, so it will run those jobs sequentially. However much more time is saved by avoiding the overhead of spinning up additional crunch jobs than the lost opportunities for parallelism when each step only runs for a few minutes.

This has no effect on test times.

I'll update the ticket when I've addressed the first two items.

#8 Updated by Peter Amstutz over 4 years ago

Actually, while the check was for debugging, it should stay. Improved the exception text.

Now at 8b7d63024652c112973d4dd82f9a5d89cc624fc7

#9 Updated by Radhika Chippada over 4 years ago

LGTM

#10 Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:523dadebfbee9a73a21c3f78c7b4af329930d393.

Also available in: Atom PDF