Bug #9552

[CWL] arvados-cwl-runner runs too slowly when reusing jobs

Added by Sarah Guthrie over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Performance
Target version:
Start date:
07/06/2016
Due date:
% Done:

75%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Running a completed UNC pipeline via arvados-cwl-runner --local --enable-reuse takes 1 hour 24 minutes: https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-dwpq8wwvf89s6cu
  • First step (merge-fastqs): 17m8s (output size: 2 files totalling 15.1 GiB)
  • Second step (star): 1h7m6s (output size: 6 files totalling 58.6 GiB)
  • Third step (salmon): 2s (output size: 3 files totalling 4.7 MiB)

Subtasks

Task #9553: Review branch 9552-compute-checksum-flagResolvedRadhika Chippada

Task #9735: Review 9552-compute-checksum-final-outputResolved

Task #9734: Compute checksum on final outputResolvedPeter Amstutz

Task #9610: Add compute_checksum flag to cwltool and submit a PRResolvedRadhika Chippada

Associated revisions

Revision 4c98448b
Added by Radhika Chippada over 4 years ago

closes #9552
Merge branch '9552-compute-checksum-flag'

Revision 8d80d7aa
Added by Radhika Chippada over 4 years ago

refs #9552
Merge branch '9552-compute-checksum-flag'

Revision 7849273c (diff)
Added by Tom Clegg over 4 years ago

Use wrapper to enable checksums in cwl conformance tests.

refs #9552

Revision c3cbac37
Added by Peter Amstutz over 4 years ago

Merge branch '9552-compute-checksum-final-output' refs #9552

Revision 6b2ee0dd (diff)
Added by Peter Amstutz over 4 years ago

Add --compute-checksum to sdk/cwl/test_with_arvbox.sh refs #9552

History

#1 Updated by Brett Smith over 4 years ago

  • Target version set to 2016-07-20 sprint

#2 Updated by Brett Smith over 4 years ago

  • Story points set to 1.0

This story is timeboxed for this sprint at 2 days. Spend up to that much time diagnosing and fixing the bug. At that time, let's reassess.

#3 Updated by Tom Clegg over 4 years ago

  • Category set to Performance
  • Assigned To set to Tom Clegg

#4 Updated by Radhika Chippada over 4 years ago

  • Assigned To changed from Tom Clegg to Radhika Chippada

#5 Updated by Sarah Guthrie over 4 years ago

The CWL which exhibited this behavior can be found in the git repository on qr1hi: sguthrie/unc under the "cwl" directory.

The command, run in unc/cwl in an environment with ARVADOS_API_TOKEN and ARVADOS_API_HOST set and arvados-cwl-runner installed, is below:

arvados-cwl-runner --local --enable-reuse merge-star-salmon.cwl basic_test.json

Let me know if you need anything else to reproduce the bug

#6 Updated by Radhika Chippada over 4 years ago

  • Status changed from New to In Progress

The time being spent is all happening in our end.

sdk/cwl/arvados_cwl/arvjob.py -> ArvadosJob.done method is parsing through each line of the log to determine the tmpdir, outdir, keepdir using tmpdirre, outdirre, keepre resp, which is taking all the extra time after “finding” the job to be reused. This information is then passed to done.py -> done method, which is using this to set builder.outdir and keepdir.

    self.builder.outdir = outdir
    self.builder.pathmapper.keepdir = keepdir
    return self.collect_outputs("keep:" + record["output"])

For the merge-fastqs job with log 263kb size, it took 11 mins to process the log lines.

2016-07-11 13:16:52 arvados.cwl-runner[33584] INFO: START
2016-07-11 13:16:55 arvados.cwl-runner[33584] INFO: GOT LOG = [u'qr1hi-8i9sb-a4p07cjcfg3906t.log.txt']
2016-07-11 13:27:11 arvados.cwl-runner[33584] INFO: DONE GETTING OUTPUTS

This seems like the most inefficient way of handling the need to set outdir and keepdir. Should this information be stored as metadata on the job record (using properties) by crunch itself? And on older jobs where this information may not be available, set this on first “reuse” to save time in all future reuses?

#7 Updated by Tom Clegg over 4 years ago

Parsing 269KB of log entries is inefficient, but not inefficient enough to explain 11 minutes. Is there something wrong with the regexps? With the Collection class?

#8 Updated by Radhika Chippada over 4 years ago

I reran the test with the regexp parsing lines commented out completely, thus leaving only the log line reading from Collection in place. This again took 10m to go though the log lines (but doing nothing with each of the lines) in the log Collection. Thus, it appears that the issue is with reading the lines from the log Collection.

2016-07-11 15:58:01 arvados.cwl-runner[58009] INFO: GOT LOG = [u'qr1hi-8i9sb-a4p07cjcfg3906t.log.txt']
2016-07-11 16:08:40 arvados.cwl-runner[58009] INFO: DONE GETTING OUTPUTS

#9 Updated by Tom Clegg over 4 years ago

Radhika Chippada wrote:

I reran the test with the regexp parsing lines commented out completely, thus leaving only the log line reading from Collection in place. This again took 10m to go though the log lines (but doing nothing with each of the lines) in the log Collection. Thus, it appears that the issue is with reading the lines from the log Collection.

That narrows it down. Have you been able to reproduce it outside CWL?

This test script runs in less than one second on shell.qr1hi, using python-arvados-python-client 0.1.20160517202250-2:

import arvados
c=arvados.collection.Collection("2a3d13d2aac3242ac0b92a91b6875c58+87")
f=c.open("qr1hi-8i9sb-a4p07cjcfg3906t.log.txt")
for x in f:
    print x,

Perhaps something breaks the Python SDK's read cache in this context, and every iteration of "for l in log" re-fetches the entire file from the network? That would increase network latency by 1640×, which is the order of magnitude of the slowdown we're seeing.

FWIW I also downloaded the log from afar (browser via keep-web) and although I didn't time it, it was somewhere around 1s, certainly nowhere near 10m.

#10 Updated by Radhika Chippada over 4 years ago

Sorry for the misinformation, but the issue is neither with the regexp parsing nor with the Collection class.

It is taking those 10+ minutes in collect_outputs call in done.py -> done method

    return self.collect_outputs("keep:" + record["output"])

It appears that this is from the cwl tool library?

#11 Updated by Peter Amstutz over 4 years ago

cwltool/draft2tool.py L412:

                        with builder.fs_access.open(files["location"], "rb") as f:
                            contents = f.read(CONTENT_LIMIT)
                            if binding.get("loadContents"):
                                files["contents"] = contents
                            filesize = 0
                            while contents != "":
                                checksum.update(contents)
                                filesize += len(contents)
                                contents = f.read(1024*1024)

Add a flag to skip the loop while contents != "": the make a pull requests against the cwltool repository.

#12 Updated by Radhika Chippada over 4 years ago

#13 Updated by Radhika Chippada over 4 years ago

PR 127 had been merged into cwltool master.

#14 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-07-20 sprint to 2016-08-03 sprint
  • Story points changed from 1.0 to 0.5

#15 Updated by Radhika Chippada over 4 years ago

ad546911

  • Updated cwltool version we are using and now we have compete_checksum flag available

#16 Updated by Tom Clegg over 4 years ago

ad54691 LGTM

#17 Updated by Radhika Chippada over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:4c98448b0966aefbf23b92f3da3a9278d2b73cab.

#18 Updated by Radhika Chippada over 4 years ago

  • Status changed from Resolved to In Progress

Reopening this because in our jenkins environment we run the conformance tests and these fail without checksum computing.

#19 Updated by Radhika Chippada over 4 years ago

Added compute-checksum flag to arvados-cwl-runn at cfa7968

#20 Updated by Tom Clegg over 4 years ago

(More background)

I understand the checksum is supposed to be optional, but the conformance test suite doesn't know this yet.

Meanwhile, a command line flag plus a wrapper script should make it possible to pass conformance tests.

#!/bin/sh
exec arvados-cwl-runner --compute-checksum "$@" 

Does cfa7968 work? It looks like this command line flag won't make a difference, because arvExecutor() says kwargs["compute_checksum"]=False. And arvExecutor already gets all the command line flags. So, instead of adding a new argument to __init__() and a new instance variable, can we just change that line in arvExecutor to say kwargs["compute_checksum"]=kwargs.get("compute_checksum") ...?

#21 Updated by Radhika Chippada over 4 years ago

Removed the instance variable.

#22 Updated by Tom Clegg over 4 years ago

LGTM @ a115812

#23 Updated by Tom Clegg over 4 years ago

The command line flag doesn't seem to be effective...

Test [6/73] Test failed: /usr/bin/cwl-runner --outdir=/tmp/tmpJro1gw --quiet v1.0/template-tool.cwl v1.0/cat-job.json Test CreateFileRequirement ExpressionEngineRequirement.engineConfig feature expected output object { "foo": { "checksum": "sha1$63da67422622fbf9251a046d7a34b7ea0fd4fead", "class": "File", "location": "foo.txt", "size": 22 } } got { "foo": { "basename": "foo.txt", "class": "File", "location": "keep:b3e162ad900be8b7c15ab5552e7c1c1c+112/foo.txt", "size": 22 } } Compare failure expected { "checksum": "sha1$63da67422622fbf9251a046d7a34b7ea0fd4fead", "class": "File", "size": 22 } got { "class": "File", "size": 22 }

#24 Updated by Tom Clegg over 4 years ago

I think what we're missing is a mechanism to pass the compute_checksum flag to the Arvados job that runs the tool. The only case where we propagate kwargs["compute_checksum"] here is the one where we execute a CommandLineTool locally. The conformance test is using the Crunch1 work API, though, so it hits the RunnerJob line:

            if self.work_api == "containers":
                if tool.tool["class"] == "CommandLineTool":
                    runnerjob = tool.job(job_order,
                                         self.output_callback,
                                         **kwargs).next()
                else:
                    runnerjob = RunnerContainer(self, tool, job_order, kwargs.get("enable_reuse"))
            else:
                runnerjob = RunnerJob(self, tool, job_order, kwargs.get("enable_reuse"))

https://github.com/curoverse/arvados/blob/master/sdk/cwl/arvados_cwl/__init__.py#L151

#25 Updated by Radhika Chippada over 4 years ago

Tom, thanks for the hint.

I used crunch2 to run the tests, and most tests passed.

 exec arvados-cwl-runner --compute-checksum --api containers "$@" 

The following tests failed and the rest (including, test 70) passed:

Test [36/73]   v1.0/env-job.json
Test [37/73]   v1.0/env-job.json
Test [39/73]   v1.0/revsort-job.json
Test [45/73]   v1.0/schemadef-job.json
Test [64/73]   v1.0/conflict-job.json

Would it be possible to run the tests using crunch2 in c97qk?

If this env is not ready for crunch2, would it be possible to (1) disable this jenkins job until that time, (2) run just the first 5 or so of the conformance tests (which can only ensure that the environment is setup properly)?

If none of these options are acceptable, please let me know what updates need to be made to RunnerJob. (preferably make this update in a separate ticket ...). We probably also need to create another ticket to see why those 5 tests are failing in crunch2 and fix any issues.

#26 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-08-03 sprint to 2016-08-17 sprint

#27 Updated by Peter Amstutz over 4 years ago

Here's a thought, instead of passing through the --compute-checksum flag to the runner job, there's no reason that the client can't check if the checksums are present, and if not, compute the checksum itself.

#28 Updated by Peter Amstutz over 4 years ago

Implemented branch 9552-compute-checksum-final-output which fills in the checksums before returning the final output object which --compute-checksum is provided.

#29 Updated by Tom Clegg over 4 years ago

9552-compute-checksum-final-output LGTM, thanks!

#30 Updated by Peter Amstutz over 4 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF