Bug #11071

[Workbench] qr1hi performance tests are broken

Added by Ward Vandewege 6 months ago. Updated 4 months ago.

Status:ResolvedStart date:03/14/2017
Priority:NormalDue date:
Assignee:Radhika Chippada% Done:

100%

Category:-
Target version:2017-03-15 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

Oldest 'OK' run: https://ci.curoverse.com/job/qr1hi-automated-performance-suite/3162/

Since then (ie Dec. 26, 2016) they have been broken. For example, from https://ci.curoverse.com/job/qr1hi-automated-performance-suite/3427/console:

                       ********** Starting performance test **********

Previous HEAD position was aaa45b0... Merge branch '10968-arvput-resume-progress' Closes #10968
HEAD is now at 85f8ab2... Bump up rails package iteration.
Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.
Using rake 10.4.2
Using RedCloth 4.2.9
Using i18n 0.7.0
Using json 1.8.3
Using minitest 5.8.4
Using thread_safe 0.3.5
Using builder 3.2.2
Using erubis 2.7.0
Using rack 1.5.5
Using mime-types 2.99
Using arel 5.0.1.20140414130214
Using addressable 2.4.0
Using andand 1.3.3
Using angularjs-rails 1.3.8
Using extlib 0.9.16
Using multi_json 1.12.1
Using multipart-post 2.0.0
Using jwt 1.5.4
Using little-plugger 1.1.4
Using memoist 0.14.0
Using os 0.9.6
Using retriable 1.4.1
Using sass 3.4.9
Using thor 0.19.1
Using columnize 0.9.0
Using debugger-linecache 1.2.0
Using slop 3.6.0
Using highline 1.6.21
Using net-ssh 2.9.2
Using mini_portile 0.6.2
Using ffi 1.9.10
Using cliver 0.3.2
Using coffee-script-source 1.8.0
Using execjs 2.7.0
Using commonjs 0.2.7
Using daemon_controller 1.2.0
Using deep_merge 1.0.1
Using docile 1.1.5
Using headless 1.0.2
Using httpclient 2.8.2.4
Using libv8 3.16.14.7
Using logstash-event 1.2.02
Using metaclass 0.0.4
Using oj 2.11.2
Using bundler 1.11.2
Using websocket-extensions 0.1.1
Using rails-perftest 0.0.5
Using raphael-rails 2.1.2
Using ref 1.0.5
Using ruby-debug-passenger 0.2.0
Using ruby-prof 0.15.2
Using rubyzip 1.1.7
Using safe_yaml 1.0.4
Using tilt 1.4.1
Using websocket 1.2.2
Using simplecov-html 0.8.0
Using sshkey 1.6.1
Using wiselinks 1.2.1
Using rake-compiler 0.9.5
Using tzinfo 1.2.2
Using rack-test 0.6.3
Using sprockets 3.2.0
Using rack-mini-profiler 0.9.2
Using mail 2.6.3
Using launchy 2.4.3
Using autoparse 0.3.3
Using faraday 0.9.2
Using logging 2.1.0
Using bootstrap-sass 3.1.1.1
Using byebug 3.5.1
Using net-scp 1.2.1
Using net-sftp 2.1.2
Using net-ssh-gateway 1.2.0
Using nokogiri 1.6.6.4
Using childprocess 0.5.6
Using coffee-script 2.3.0
Using uglifier 2.7.2
Using less 2.6.0
Using passenger 4.0.57
Using mocha 1.1.0
Using websocket-driver 0.5.1
Using therubyracer 0.12.1
Using simplecov 0.9.1
Using fast_stack 0.1.0
Using activesupport 4.1.12
Using signet 0.7.2
Using capistrano 2.15.5
Using xpath 2.0.0
Using selenium-webdriver 2.53.1
Using simplecov-rcov 0.2.3
Using flamegraph 0.1.0
Using actionview 4.1.12
Using activemodel 4.1.12
Using googleauth 0.5.1
Using rvm-capistrano 1.5.5
Using capybara 2.5.0
Using actionpack 4.1.12
Using activerecord 4.1.12
Using google-api-client 0.8.6
Using poltergeist 1.5.1
Using actionmailer 4.1.12
Using railties 4.1.12
Using less-rails 2.6.0
Using sprockets-rails 2.3.2
Using activerecord-nulldb-adapter 0.3.1
Using arvados 0.1.20160420143004
Using bootstrap-tab-history-rails 0.1.0
Using bootstrap-x-editable-rails 1.5.1.1
Using coffee-rails 4.1.0
Using jquery-rails 3.1.2
Using lograge 0.3.6
Using morrisjs-rails 0.5.1
Using rails 4.1.12
Using sass-rails 5.0.1
Using piwik_analytics 1.0.2
Using themes_for_rails 0.5.1 from https://github.com/curoverse/themes_for_rails (at master@6115487)
Bundle complete! 46 Gemfile dependencies, 116 gems now installed.
Use `bundle show [gemname]` to see where a bundled gem is installed.
Called 'load' without the :safe option -- defaulting to safe mode.
You can avoid this warning in the future by setting the SafeYAML::OPTIONS[:default_mode] option (to :safe or :unsafe).
Called 'load' without the :safe option -- defaulting to safe mode.
You can avoid this warning in the future by setting the SafeYAML::OPTIONS[:default_mode] option (to :safe or :unsafe).
Using display :182 for headless tests
Using port 32800 for poltergeist
Run options: --seed 41450

# Running:

BrowsingTest#test_home_page (5.93 sec warmup)
           wall_time: 5.06 sec
.Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-1.png
BrowsingTest#test_search_for_hash (0 ms warmup)
Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-2.png
Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-3.png
Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-4.png
Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-5.png
Error during WebSocket handshake: location mismatch: wss://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g != ws://ws.qr1hi.arvadosapi.com/websocket?api_token=5rrtrgnlb2n20qtp2mft5cqpi7psqr6l29mbnmsmtwzra2jt0g
Saved ./tmp/workbench-fail-6.png
           wall_time: 0 ms
E

Finished in 445.442299s, 0.0045 runs/s, 0.0000 assertions/s.

  1) Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

Error:
BrowsingTest#test_search_for_hash:
Capybara::ExpectationNotMet: expected to find css ".selectable[data-object-uuid]" but there were no matches
    test/performance/browsing_test.rb:43:in `block (2 levels) in <class:BrowsingTest>'
    test/performance/browsing_test.rb:40:in `block in <class:BrowsingTest>'
    test/test_helper.rb:299:in `run'

2 runs, 0 assertions, 0 failures, 1 errors, 0 skips
Coverage report generated for Unit Tests to /data/1/jenkins/workspace/qr1hi-automated-performance-suite/apps/workbench/coverage. 198 / 473 LOC (41.86%) covered.
Coverage report Rcov style generated for Unit Tests to /data/1/jenkins/workspace/qr1hi-automated-performance-suite/apps/workbench/coverage/rcov
rake aborted!
Command failed with status (1): [ruby -I"lib:test" -I"/usr/local/rvm/gems/ruby-2.1.5/gems/rake-10.4.2/lib" "/usr/local/rvm/gems/ruby-2.1.5/gems/rake-10.4.2/lib/rake/rake_test_loader.rb" "test/performance/**/*_test.rb" ]
/usr/local/rvm/gems/ruby-2.1.5/bin/ruby_executable_hooks:15:in `eval'
/usr/local/rvm/gems/ruby-2.1.5/bin/ruby_executable_hooks:15:in `<main>'
Tasks: TOP => test:benchmark
(See full trace by running task with --trace)
Previous HEAD position was 85f8ab2... Bump up rails package iteration.
Switched to branch 'master'
Your branch is behind 'origin/master' by 4870 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)

The error about websockets is a bit mysterious but also probably a red herring. It may be related to the older version of phantomjs (1.9.8) we are pinned to, at least according to this page: http://www.howtobuildsoftware.com/index.php/how-do/7o0/amazon-web-services-meteor-websocket-phantomjs-error-during-websocket-handshake-location-mismatch-wss-ws-

I verified in the logs on qr1hi that the wss:// service is the one being contacted during the test (so it's not trying to connect to the non-tls version).

The first diagnostics test that fails (BrowsingTest#test_search_for_hash) could be a legitimate failure: doing the same thing (searching for the word 'hash' in the search box) takes 1.3 minutes according to my browser's debug console.

Suggested actions:

a) upgrade workbench's bundle to see if that helps with the wss vs ws issue
b) fix search performance on qr1hi (for reference, the same search takes just 33s on su92l)


Subtasks

Task #11205: Review branch 11071-fts-perf-testResolvedRadhika Chippada


Related issues

Copied to Arvados - Bug #11261: [API] full text search on pipeline_instances table is ext... New

Associated revisions

Revision b1aa6c85
Added by Radhika Chippada 4 months ago

closes #11071
Merge branch '11071-fts-perf-test'

History

#1 Updated by Ward Vandewege 6 months ago

  • Description updated (diff)

#2 Updated by Ward Vandewege 6 months ago

  • Description updated (diff)

#3 Updated by Tom Morris 5 months ago

  • Target version set to 2017-03-15 sprint

#4 Updated by Tom Morris 5 months ago

  • Description updated (diff)

Here are the first few commits after the last successful pass (read from bottom up):

commit b27f1106beb0fb321ee123501afb5ccf2354a363
Author: Tom Clegg <>
Date: Wed Dec 28 10:21:25 2016 -0500

Load keep-balance config from default location if -config arg not given.
No issue #

commit 6cec281e8653731602ab871cc41ddd21ca8182ab
Author: Tom Clegg <>
Date: Wed Dec 28 00:52:29 2016 -0500

Fix use of unquoted user input in regexp.

commit 4ee5a6efb8b01afddddaa8ce5ed7d0de42a287d6
Author: Tom Clegg <>
Date: Wed Dec 28 00:26:42 2016 -0500

10538: Add /arvados/v1/collections/{uuid}/trash endpoint.

commit eebb0457f1bb69ec2084cbefcbff8c19e404d556
Author: Tom Clegg <>
Date: Tue Dec 27 14:01:11 2016 -0500

10538: Use include_trash in keep-balance to avoid premature block deletes.

commit 0aa01993bc8533d417dc510b1860a4a583e093b4
Author: Tom Clegg <>
Date: Tue Dec 27 13:57:53 2016 -0500

10538: Exchange expires_at for new trash_at and delete_at columns.

commit cdd8dc7bc4cca452e25c5b014e5f2bb592fb31ce
Author: Tom Clegg <>
Date: Tue Dec 27 15:30:03 2016 -0500

Close database connections when ending threads.

commit f6e68adfecb95e42e984951794c96f54166b7f13
Merge: ec600fb 90633a9
Author: Tom Clegg <>
Date: Tue Dec 27 11:16:18 2016 -0500

Merge branch '10273-check-superuser-token-scope' refs #10273

#5 Updated by Tom Morris 5 months ago

Actually the commits are likely a red herring since it doesn't look like qr1hi was deployed on or immediately after Dec. 26, 2016.

The success threshold for this test is 50 seconds (it'd be nice to have that in the log somewhere).

Times (X-Runtime in seconds) for the same search for "hash" from Workbench are:

qr1hi: 80.7 sec
su92l: 20.5
e51c5:  3.5

Since the search code appears to search each table separately in a fixed order and qr1hi only gets as far as jobs, while others get to pipeline instances, my working theory is that this is related to the size and/or performance of the jobs table.

The performance threshold is way too long for this test. We should get it fixed and lower the threshold to some reasonable starting point (20-30 sec?) and then work on lowering it more over time.

#6 Updated by Tom Morris 5 months ago

Another bit of evidence is this long API call from the logs on qr1hi which looks like it basically accounts for all of the ~80 seconds:

{"method":"POST",
"path":"/arvados/v1/groups/contents",
"format":"*/*",
"controller":"arvados/v1/groups",
"action":"contents",
"status":200,
"duration":79628.01,
"view":0.52,
"db":78694.26,
"params":{
 "api_token":"zzzz",
 "reader_tokens":"[\"xxxx\"]",
"current_request_id":"1487798758-607281570",
"_method":"GET",
"limit":"200",
"offset":"0",
"filters":"[[\"any\",\"@@\",\"hash:*\"]]"},
"@timestamp":"2017-02-22T21:27:18Z",
"@version":"1",
"message":"[200] POST /arvados/v1/groups/contents (arvados/v1/groups#contents)" 
}

#8 Updated by Tom Clegg 5 months ago

Looks like we need to use count=none for Workbench search...

#9 Updated by Tom Morris 5 months ago

  • Status changed from New to In Progress
  • Assignee set to Radhika Chippada

#10 Updated by Radhika Chippada 5 months ago

Branch 11071-fts-perf-test @ cad4856 takes a stab at using count='none' for full text search in workbench.

This needed API server updates because the contents method was not actually working correctly when "count = none" for multiple object types (works correctly only with the first object type or until first time @limit is reached). While at it I also enhanced this method to not process next object types once we have @limit number of items on hand in this context.

Still need to add tests and get some numbers to compare before and after performance.

#12 Updated by Lucas Di Pentima 4 months ago

This lgtm. Only one little question: Do you think that the line at services/api/app/controllers/arvados/v1/groups_controller.rb:99 could be improved in terms of readabilty?
Have you made some performance comparison tests?

Thanks!

#13 Updated by Radhika Chippada 4 months ago

Do you think that the line at services/api/app/controllers/arvados/v1/groups_controller.rb:99 could be improved in terms of readability?

Right. It was hard to follow. I updated the indentation and it is easier to follow now. Thanks.

Have you made some performance comparison tests?

I have current numbers from 4xphq and will add the "after" numbers once the environment gets updated. Will add them to this ticket afterwards. Thanks.

#14 Updated by Radhika Chippada 4 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:b1aa6c85c3db947f0963b1bddb2784c115f5c97d.

#15 Updated by Radhika Chippada 4 months ago

I searched using "hash" in our staging environment, and I saw about 10% improvement for the first page.

Before:

{"method":"POST","path":"/arvados/v1/groups/contents","format":"*/*","controller":"arvados/v1/groups","action":"contents","status":200,"duration":624.68,"view":0.28,"db":129.8,"params":{"api_token":"47vn13qu20hv5r772k7lsnm4b24042w0q8wsc9ljy9nm88rhdm","reader_tokens":"[\"2eic861cbmi1q3b91rl102yr1zli02dzwrfbo1up5fdnkqi5cy\"]","current_request_id":"1489510582-520279100","_method":"GET","limit":"200","offset":"0","filters":"[[\"any\",\"@@\",\"hash:*\"]]"},"@timestamp":"2017-03-14T16:56:22.983830111Z","@version":"1","message":"[200] POST /arvados/v1/groups/contents (arvados/v1/groups#contents)"}

====

After:

{"method":"POST","path":"/arvados/v1/groups/contents","format":"*/*","controller":"arvados/v1/groups","action":"contents","status":200,"duration":565.18,"view":0.23,"db":62.36,"params":{"api_token":"47vn13qu20hv5r772k7lsnm4b24042w0q8wsc9ljy9nm88rhdm","reader_tokens":"[\"2eic861cbmi1q3b91rl102yr1zli02dzwrfbo1up5fdnkqi5cy\"]","current_request_id":"1489529326-531454209","_method":"GET","limit":"200","offset":"0","count":"none","last_object_class":"","filters":"[[\"any\",\"@@\",\"hash:*\"]]"},"@timestamp":"2017-03-14T22:08:47.588090871Z","@version":"1","message":"[200] POST /arvados/v1/groups/contents (arvados/v1/groups#contents)"}

#16 Updated by Tom Clegg 4 months ago

Version query:

curl https://${ARVADOS_API_HOST}/discovery/v1/apis/arvados/v1/rest | jq .source_version

Test query:

time curl -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" -d count=none -d filters='[["any","@@","hash:*"]]' -d limit=1 -d _method=GET https://${ARVADOS_API_HOST}/arvados/v1/groups/contents | jq '.items[0].href'

results:

version cluster count= real
7eda588-7 qr1hi none 1m19.801s
7eda588-7 qr1hi exact 1m20.374s
b1aa6c8-7 4xphq none 0m0.046s
b1aa6c8-7 4xphq exact 0m7.106s
b1aa6c8-7 qr1hi none 0m0.129s
b1aa6c8-7 qr1hi exact 1m18.449s

Another test query that ensures we query the collections table:

time curl -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" -d count=none -d filters='[["any","@@","hash:*"],["uuid","is_a","arvados#collection"]]' -d limit=1 -d _method=GET https://${ARVADOS_API_HOST}/arvados/v1/groups/contents | jq '.items[0].href'

results:

version cluster count= real
7eda588-7 qr1hi none 0m0.134s
7eda588-7 qr1hi exact 0m0.143s
b1aa6c8-7 4xphq none 0m0.051s
b1aa6c8-7 4xphq exact 0m0.063s

Hm. Tried a few other queries on qr1hi, and all of the slowness seems to be the fault of the pipeline_instances table:

count= uuid is_a ... real
exact arvados#pipelineTemplate 0m0.059s
exact arvados#containerRequest 0m0.186s
exact arvados#job 0m0.106s
exact arvados#workflow 0m0.119s
exact arvados#group 0m0.068s
exact arvados#collection 0m0.143s
exact arvados#pipelineInstance 2m40.890s
none arvados#pipelineInstance 2m41.996s

#17 Updated by Nico C├ęsar 4 months ago

I updated the table above:

time curl -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" -d count=none -d filters='[["any","@@","hash:*"]]' -d limit=1 -d _method=GET https://${ARVADOS_API_HOST}/arvados/v1/groups/contents | jq '.items[0].href'

results:

version cluster count= real
b1aa6c8-7 qr1hi none 0m0.129s
b1aa6c8-7 qr1hi exact 1m18.449s

also with version b1aa6c8-7:

time curl -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" -d count=none -d filters='[["any","@@","hash:*"],["uuid","is_a","arvados#collection"]]' -d limit=1 -d _method=GET https://${ARVADOS_API_HOST}/arvados/v1/groups/contents | jq '.items[0].href'
count= uuid is_a ... real
none arvados#pipelineTemplate 1m18.454s
exact arvados#pipelineTemplate 2m37.067s

Also available in: Atom PDF