Bug #9773

[CWL] It takes 1.25 hours to re-use 300 jobs

Added by Bryan Cosca almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/11/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

For example: https://workbench.qr1hi.arvadosapi.com/pipeline_instances/qr1hi-d1hrv-nkdgmvy6d0tzgbi

The pipeline instance was created at 3:52pm 8/10/16

The variant_effect_predictor job started at 5:06pm 8/10/16

I would like it a lot faster :)


Subtasks

Task #9829: Review 9773-faster-find-or-createResolvedTom Clegg


Related issues

Related to Arvados - Story #9888: [API] Refactor job reuse code to be in the Job model instead of buried in JobsControllerResolved09/01/2016

Associated revisions

Revision 1c258a38
Added by Tom Clegg almost 4 years ago

Merge branch '9773-faster-find-or-create'

refs #9773

History

#1 Updated by Bryan Cosca almost 4 years ago

It seems to take 3-4 seconds to reuse a job for this pipeline though...

arvados-cwl-runner --local --verbose bclfastq-wf.cwl bclfastq.yml 
/home/bcosc/venv-test3/bin/arvados-cwl-runner 1.0.20160811221813, arvados-python-client 0.1.20160727172639, cwltool 1.0.20160811184335
2016-08-18 13:56:54 arvados.cwl-runner[63497] INFO: Pipeline instance qr1hi-d1hrv-tf8jy3x3hbbwepx
2016-08-18 13:56:55 arvados.arv-run[63497] INFO: Upload local files: "bclfastq2.sh" 
2016-08-18 13:56:56 arvados.arv-run[63497] INFO: Using collection qr1hi-4zz18-20vkqqo8c5r02oz
2016-08-18 13:56:58 arvados.cwl-runner[63497] INFO: Job bcl2fastq (qr1hi-8i9sb-cqxav9rgw0jnvtz) is Complete
[step bcl2fastq] completion status is success
2016-08-18 13:57:00 arvados.arv-run[63497] INFO: Upload local files: "move-fastqs-to-dir.py" 
2016-08-18 13:57:00 arvados.arv-run[63497] INFO: Uploaded to qr1hi-4zz18-4ohl74z33bppze4
2016-08-18 13:57:02 arvados.cwl-runner[63497] INFO: Job movetodir (qr1hi-8i9sb-jmscypd7qfrbgiv) is Complete
[step movetodir] completion status is success
2016-08-18 13:57:05 arvados.cwl-runner[63497] INFO: Job archerdemux (qr1hi-8i9sb-qux4i851f0l3ybf) is Complete
2016-08-18 13:57:09 arvados.cwl-runner[63497] INFO: Job archerdemux_2 (qr1hi-8i9sb-m2ig7cr4ibuu92j) is Complete
2016-08-18 13:57:12 arvados.cwl-runner[63497] INFO: Job archerdemux_3 (qr1hi-8i9sb-97cghd8jw3o57he) is Complete
2016-08-18 13:57:16 arvados.cwl-runner[63497] INFO: Job archerdemux_4 (qr1hi-8i9sb-ja4llmwtd4rk8ot) is Complete
2016-08-18 13:57:20 arvados.cwl-runner[63497] INFO: Job archerdemux_5 (qr1hi-8i9sb-kynvdn4zf68arpw) is Complete
2016-08-18 13:57:23 arvados.cwl-runner[63497] INFO: Job archerdemux_6 (qr1hi-8i9sb-sng1ilw8ylvp3cz) is Complete
2016-08-18 13:57:27 arvados.cwl-runner[63497] INFO: Job archerdemux_7 (qr1hi-8i9sb-on80m1ollu8w5m0) is Complete

This one has 150 jobs, so I'm still waiting for 10 minutes for each re-run...

#2 Updated by Bryan Cosca almost 4 years ago

maybe the job reuses can happen in parallel? is that how arvados pipeline templates work and not cwl?

#3 Updated by Bryan Cosca almost 4 years ago

I think its increasing over time....

( from the same instance as note #1 )

unner[63497] INFO: Job archerdemux_120 (qr1hi-8i9sb-58y47df4c3lvh6x) is Complete
2016-08-18 14:07:16 arvados.cwl-runner[63497] INFO: Job archerdemux_121 (qr1hi-8i9sb-6xxojux716h7r4v) is Complete
2016-08-18 14:07:23 arvados.cwl-runner[63497] INFO: Job archerdemux_122 (qr1hi-8i9sb-6gth0en6kxgbx13) is Complete
2016-08-18 14:07:29 arvados.cwl-runner[63497] INFO: Job archerdemux_123 (qr1hi-8i9sb-y8hnjx5npiap30q) is Complete
2016-08-18 14:07:36 arvados.cwl-runner[63497] INFO: Job archerdemux_124 (qr1hi-8i9sb-5nmvg9cs0ujtxdk) is Complete
2016-08-18 14:07:43 arvados.cwl-runner[63497] INFO: Job archerdemux_125 (qr1hi-8i9sb-x6plniqy53tpnpo) is Complete
2016-08-18 14:07:50 arvados.cwl-runner[63497] INFO: Job archerdemux_126 (qr1hi-8i9sb-ona1snr3ypgsg1y) is Complete
2016-08-18 14:07:56 arvados.cwl-runner[63497] INFO: Job archerdemux_127 (qr1hi-8i9sb-4sp08zcrc7hwmtm) is Complete
2016-08-18 14:08:03 arvados.cwl-runner[63497] INFO: Job archerdemux_128 (qr1hi-8i9sb-n5xrbjgknz8wyzc) is Complete
2016-08-18 14:08:10 arvados.cwl-runner[63497] INFO: Job archerdemux_129 (qr1hi-8i9sb-2a6x8l07dy6vold) is Complete
2016-08-18 14:08:16 arvados.cwl-runner[63497] INFO: Job archerdemux_130 (qr1hi-8i9sb-8gq459ailvr46ns) is Complete
2016-08-18 14:08:23 arvados.cwl-runner[63497] INFO: Job archerdemux_131 (qr1hi-8i9sb-vv99lpfwx62yfzo) is Complete
2016-08-18 14:08:30 arvados.cwl-runner[63497] INFO: Job archerdemux_132 (qr1hi-8i9sb-ojcy0fw50bel8pa) is Complete
2016-08-18 14:08:37 arvados.cwl-runner[63497] INFO: Job archerdemux_133 (qr1hi-8i9sb-883ike8c5b6ig3p) is Complete
2016-08-18 14:08:44 arvados.cwl-runner[63497] INFO: Job archerdemux_134 (qr1hi-8i9sb-wh9wzq55xa805hu) is Complete
2016-08-18 14:08:51 arvados.cwl-runner[63497] INFO: Job archerdemux_135 (qr1hi-8i9sb-iz67yk93gti8rk5) is Complete
2016-08-18 14:08:58 arvados.cwl-runner[63497] INFO: Job archerdemux_136 (qr1hi-8i9sb-zuoibeypq840kb1) is Complete
2016-08-18 14:09:05 arvados.cwl-runner[63497] INFO: Job archerdemux_137 (qr1hi-8i9sb-7778iw7qhb1uneo) is Complete
2016-08-18 14:09:12 arvados.cwl-runner[63497] INFO: Job archerdemux_138 (qr1hi-8i9sb-jom8ps6ikt6pjgm) is Complete
2016-08-18 14:09:19 arvados.cwl-runner[63497] INFO: Job archerdemux_139 (qr1hi-8i9sb-w3280yxbppu8xri) is Complete
2016-08-18 14:09:26 arvados.cwl-runner[63497] INFO: Job archerdemux_140 (qr1hi-8i9sb-wuq8m5km0otuo9l) is Complete
2016-08-18 14:09:34 arvados.cwl-runner[63497] INFO: Job archerdemux_141 (qr1hi-8i9sb-xgof4czghf4w1fk) is Complete
2016-08-18 14:09:41 arvados.cwl-runner[63497] INFO: Job archerdemux_142 (qr1hi-8i9sb-39a5xlllwya0a3v) is Complete
2016-08-18 14:09:49 arvados.cwl-runner[63497] INFO: Job archerdemux_143 (qr1hi-8i9sb-6dfdsjatasevyrt) is Complete
2016-08-18 14:09:56 arvados.cwl-runner[63497] INFO: Job archerdemux_144 (qr1hi-8i9sb-xtrzupjeb66s6nz) is Complete
2016-08-18 14:10:04 arvados.cwl-runner[63497] INFO: Job archerdemux_145 (qr1hi-8i9sb-g3uxz5ut215deio) is Complete
2016-08-18 14:10:11 arvados.cwl-runner[63497] INFO: Job archerdemux_146 (qr1hi-8i9sb-cvne85iaf7f5hod) is Complete
2016-08-18 14:10:19 arvados.cwl-runner[63497] INFO: Job archerdemux_147 (qr1hi-8i9sb-c07fz5qm3el13nq) is Complete
2016-08-18 14:10:26 arvados.cwl-runner[63497] INFO: Job archerdemux_148 (qr1hi-8i9sb-xmw0aa9w597v18q) is Complete
2016-08-18 14:10:34 arvados.cwl-runner[63497] INFO: Job archerdemux_149 (qr1hi-8i9sb-w0d2jqq9eneo5sf) is Complete
2016-08-18 14:10:41 arvados.cwl-runner[63497] INFO: Job archerdemux_150 (qr1hi-8i9sb-rkhpyn53uxeykgb) is Complete
2016-08-18 14:10:49 arvados.cwl-runner[63497] INFO: Job archerdemux_151 (qr1hi-8i9sb-b9red2lfdcqdvm3) is Complete
2016-08-18 14:10:57 arvados.cwl-runner[63497] INFO: Job archerdemux_152 (qr1hi-8i9sb-jhr7yw47ds5het1) is Complete
2016-08-18 14:11:04 arvados.cwl-runner[63497] INFO: Job archerdemux_153 (qr1hi-8i9sb-ri2w2l5543heyay) is Complete
2016-08-18 14:11:12 arvados.cwl-runner[63497] INFO: Job archerdemux_154 (qr1hi-8i9sb-afkfh92ubf19ull) is Complete
2016-08-18 14:11:19 arvados.cwl-runner[63497] INFO: Job archerdemux_155 (qr1hi-8i9sb-adf2npm03a02a96) is Complete
2016-08-18 14:11:28 arvados.cwl-runner[63497] INFO: Job archerdemux_156 (qr1hi-8i9sb-8dg8m9b3murldfu) is Complete
2016-08-18 14:11:36 arvados.cwl-runner[63497] INFO: Job archerdemux_157 (qr1hi-8i9sb-92tllhasky87tk3) is Complete
2016-08-18 14:11:44 arvados.cwl-runner[63497] INFO: Job archerdemux_158 (qr1hi-8i9sb-g9s91sn3may7n4l) is Complete
2016-08-18 14:11:51 arvados.cwl-runner[63497] INFO: Job archerdemux_159 (qr1hi-8i9sb-fblmoo9tst40ldq) is Complete

#4 Updated by Tom Morris almost 4 years ago

  • Assigned To set to Tom Clegg
  • Target version set to 2016-08-31 sprint

#5 Updated by Tom Clegg almost 4 years ago

Rails logs are painful but I think I'm seeing, for each pipeline component/job:
  • POST /arvados/v1/jobs?...&find_or_create=true&... taking ~1.5 seconds
  • update pipeline instance with found job
  • get collection (by PDH), select manifest_text
  • create new collection
  • get same collection again (by PDH)

This seems a little wasteful but I don't think it comes close enough to explaining 4 seconds per job.

The 1.5-second "find or create" query is something we might be able to speed up on the API side. Improving the cwl runner to do multiple concurrent queries would probably help too, but that might be harder, and won't always help (e.g., in a serial pipeline, the N+1st query can't be built until the response to the Nth query is received).

#6 Updated by Tom Clegg almost 4 years ago

"Getting worse over time" suggests the "get all maybe-matching jobs from Postgres, and filter them further in Ruby" approach is making a significant contribution.

This pipeline has lots of jobs with identical docker image, command, version, etc., but different script_parameters. Currently we filter on docker/command/version in Postgres, and filter script_parameters in Ruby, so this is just the sort of pattern we'd expect to be slow (and slower with every run).

#7 Updated by Bryan Cosca almost 4 years ago

FWIW, this kind of job pattern is getting very common when using CWL. The ScatterMethod class creates a new job on each sample file. I'm making pipelines that have 300+ jobs, almost all of which have the pattern you described because of this scatter method.

#8 Updated by Tom Clegg almost 4 years ago

9773-faster-find-or-create @ 051a325 queued for jenkins testing.

#9 Updated by Ward Vandewege almost 4 years ago

Tom Clegg wrote:

9773-faster-find-or-create @ 051a325 queued for jenkins testing.

Build 147 failed (services/api and services/nodemanager). The nodemanager test looks like it may be a race. The api server test, I'm not sure:

  1) Failure:
ArvadosModelTest#test_search_index_exists_on_models_that_go_into_projects [/data/1/jenkins/workspace/developer-test-job/services/api/test/unit/arvados_model_test.rb:144]:
jobs has no search index with columns ["uuid", "owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid", "submit_id", "script", "script_version", "cancelled_by_client_uuid", "cancelled_by_user_uuid", "output", "is_locked_by_uuid", "log", "repository", "supplied_script_version", "docker_image_locator", "state", "arvados_sdk_version", "script_parameters_digest"]. Instead found search index with columns ["uuid", "owner_uuid", "modified_by_client_uuid", "modified_by_user_uuid", "submit_id", "script", "script_version", "cancelled_by_client_uuid", "cancelled_by_user_uuid", "output", "is_locked_by_uuid", "log", "repository", "supplied_script_version", "docker_image_locator", "state", "arvados_sdk_version"]

I kicked it off again just in case, as build 149.

#10 Updated by Ward Vandewege almost 4 years ago

Ward Vandewege wrote:

I kicked it off again just in case, as build 149.

149 failed too, same services/api error so I think that one is real. The nodemanager error is gone so that is a race.

#11 Updated by Tom Clegg almost 4 years ago

Fixed the index test (by dropping script_parameters_digest from searchable_columns) in a3e6fef and kicked off

https://ci.curoverse.com/job/developer-test-job/150/
https://ci.curoverse.com/job/developer-test-job/151/

#12 Updated by Ward Vandewege almost 4 years ago

Tom Clegg wrote:

Fixed the index test (by dropping script_parameters_digest from searchable_columns) in a3e6fef and kicked off

https://ci.curoverse.com/job/developer-test-job/150/
https://ci.curoverse.com/job/developer-test-job/151/

Hrm, both failed with the same set of selenium errors in the workbench integration tests. I kicked off another build just in case.

#14 Updated by Tom Clegg almost 4 years ago

Note: you can expect the migration 20160819195725_populate_script_parameters_digest to take a while on a production site with lots of job records.

#15 Updated by Radhika Chippada almost 4 years ago

- In the PopulateScriptParametersDigest migration script

  • Every job within the list of selected jobs is being updated here (due to id=?), even though the update_all can eliminate getting jobs with the same script_parameters in the next batch. So, I am wondering if fetching a smaller number of objects (limit(200)?) would yield better overall performance?
        Job.
          where('id=? or script_parameters=?', j.id, j.script_parameters.to_yaml).
          update_all(script_parameters_digest: j.update_script_parameters_digest)

- In JobsController: In the following code, would there be a scenario where there is Completed job and Running / Queued jobs? In this scenario, I’d assume that “if @object.output != j.output” would result in ignoring both and starting a third job?

+      where('state = ? or (owner_uuid = ? and state in (?))',
+            Job::Complete, current_user.uuid, [Job::Queued, Job::Running]).
+      where('script_parameters_digest = ?', Job.sorted_hash_digest(resource_attrs[:script_parameters])).
+      where('nondeterministic is distinct from ?', true).
+      order('state desc, created_at') # prefer Running jobs over Queued

LGTM with those questions clarified.

#16 Updated by Tom Clegg almost 4 years ago

Radhika Chippada wrote:

I am wondering if fetching a smaller number of objects (limit(200)?) would yield better overall performance?

Yes, I think you're right. No particular advantage to having a larger batch size. Changed to 200.

- In JobsController: In the following code, would there be a scenario where there is Completed job and Running / Queued jobs? In this scenario, I’d assume that “if @object.output != j.output” would result in ignoring both and starting a third job?

Ah, yes. If we have

job state output
A Complete 123
B Complete 456
C Running

...then a3e6fef would start a new job, where the old code would have used job C.

It seems better to preserve the old behavior in this case.

897f8c0

#17 Updated by Tom Morris almost 4 years ago

  • Status changed from New to In Progress

#18 Updated by Tom Clegg almost 4 years ago

  • Target version changed from 2016-08-31 sprint to 2016-09-14 sprint

#19 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Feedback

#20 Updated by Tom Clegg over 3 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF