Bug #5614

[Workbench] 504 error returned when trying to create a large new collection

Added by Sarah Guthrie about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
03/31/2015
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
-

Description

URL creating the timeout:
tb05z-tpzed-1pg2t8r0nk0mxsa%22">https://workbench.tb05z.arvadosapi.com/combine_selected?action_data={%22current_project_uuid%22%3A%22tb05z-tpzed-1pg2t8r0nk0mxsa%22}

All files except '.listing' were selected to be put into the new collection.

Collection at this URL:
https://workbench.tb05z.arvadosapi.com/collections/tb05z-4zz18-exuu9wl8ft9s2js


Subtasks

Task #5686: Review 5614-workbench-optimize-combine-collections-wipResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #4943: [Workbench] [Performance] Combining big collections should start returning a response faster (currently you can get a 502 proxy error even if the collection still combines)New01/08/2015

Associated revisions

Revision 43aa02f5
Added by Brett Smith about 4 years ago

Merge branch '5614-workbench-optimize-combine-collections-wip'

Refs #5614, #4943. Closes #5686.

Revision a8ef8be8 (diff)
Added by Brett Smith about 4 years ago

5614: Update Workbench Gemfile for new Ruby SDK.

Refs #5614.

History

#1 Updated by Tom Clegg about 4 years ago

  • Target version set to Bug Triage

#2 Updated by Ward Vandewege about 4 years ago

  • Subject changed from 504 error returned when trying to create a large new collection to [Workbench] 504 error returned when trying to create a large new collection

#3 Updated by Brett Smith about 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith

#4 Updated by Brett Smith about 4 years ago

  • Target version changed from Bug Triage to 2015-04-29 sprint

#5 Updated by Radhika Chippada about 4 years ago

Review comments:

  • Is this update sufficient enough to address all the performance issues around combining large collections?

    I was unable to access the collection listed in the ticket (collections/tb05z-4zz18-exuu9wl8ft9s2js); Also, I was not sure if is correct or not to point to one of the other environment's API server with my local workbench (because the ruby sdk gem changed). I did point to qr1hi and tested for #4943 and this one still fails. Whether my test is legitimate or not, I doubt if the code updates in this branch go far enough to resolve the performance issue around combining large collections. I have a couple questions / comments around this.

    We have two specific combine collections cases. (1) where a number of files from a specific (single) collection are being combined, (2) two or more collections in their entirety are being combined.
    In case of (1): Assuming this is still not resolved with this code update or otherwise, I wonder: we already have the manifest text for the specific collection when the combine action is being executed. So, would it be possible to not get it again from the api server? (Line 145 of action_controller.rb — select([:uuid, :manifest_text]).each … ). Instead reuse the the manifest text that is already at hand?
    In case of (2): would it be possible to push this whole combining manifest texts business into keep instead of generating the combined manifest text in workbench? By doing so, we do not have to retrieve the manifest text of all the involved large collections, and then combine to get the new combined manifest text, and send it over the wire. Instead we can just tell keep which collections (in their entirety) need to be collected and hence it can be much better performance?
  • As for the current implementation: I think it would be nice to use the “distinct” feature added as part of #5365 at line #135 (select([:uuid, :portable_data_hash]).each …) to get just one uuid for the pdh from api server. Otherwise, we will still get all the uuids matching this pdh, which will be costly if we are working with any of the most popular pdhs.
  • Regarding generating provenance graph links after rendering the new collection: would there be a situation where we redirect user to the new collection (indicating successful processing), and then we experience errors while creating the provenance graph links? Also, would there be a situation where the user sees the new collection and clicks on the provenance graph link while the links are still being created? In light of these issues, I am thinking it is better to complete this (as before) before redirecting the user to the new collection.
  • test_collection.rb => test_copy_stream_over_file_raises_ENOTDIR
    Should line 229 use “source” instead of “./s1”?

#6 Updated by Brett Smith about 4 years ago

Radhika Chippada wrote:

Review comments:

  • Is this update sufficient enough to address all the performance issues around combining large collections?

It definitely isn't. Even after we merge this branch, I'm sure that it will be possible to make combination requests large enough to cause timeouts. I would not close this ticket when I merged this branch. As I said at standup, I think this is the most performance we can get out of Workbench's current architecture. That is the most I felt comfortable doing in the context of bug triage. The more involved solutions you discussed are all good ideas, but they require more architectural planning, and I don't think it would've been appropriate to implement something like that without putting it through the normal planning process first.

I think it's worth merging this to see how much real-world benefit it gets us (and to reduce code duplication by, e.g., using the Ruby SDK more). The feedback we get may help us decide which steps are best to take next.

I was unable to access the collection listed in the ticket (collections/tb05z-4zz18-exuu9wl8ft9s2js)

I can't either. As far as I can tell, it's gone. I tried to track it down through the logs, but didn't have any luck there either.

Also, I was not sure if is correct or not to point to one of the other environment's API server with my local workbench (because the ruby sdk gem changed).

That should be fine. Because the branch only uses the changes on the Workbench end, there's no problem talking to an older API server.

We have two specific combine collections cases. (1) where a number of files from a specific (single) collection are being combined, (2) two or more collections in their entirety are being combined.

In case of (1): Assuming this is still not resolved with this code update or otherwise, I wonder: we already have the manifest text for the specific collection when the combine action is being executed. So, would it be possible to not get it again from the api server? (Line 145 of action_controller.rb — select([:uuid, :manifest_text]).each … ). Instead reuse the the manifest text that is already at hand?

I don't believe the manifest text is already at hand in the context of handling the combine request. Because the route is not associated with any specific object, no object is loaded during the request's @before_filter@s. Here is a log from creating a new collection out of files from one existing collection in the test fixtures. It shows that the collection is only loaded once:

Started POST "/combine_selected?action_data=%7B%22current_project_uuid%22%3A%22zzzzz-tpzed-xurymjxw79nv3jz%22%7D" for ::1 at 2015-04-13 10:06:30 -0400
Processing by ActionsController#combine_selected_files_into_collection as HTML
  Parameters: {"authenticity_token"=>"f3k6NOmbD0UU/n2Swc+gqtAdfvGoGVp4D0skA+2u4wc=", "selection"=>["zzzzz-4zz18-duplicatenames1/dir2/alice.txt"], "action_data"=>"{\"current_project_uuid\":\"zzzzz-tpzed-xurymjxw79nv3jz\"}"}
API client: 0.000243524 Prepare request https://localhost:3030/arvados/v1/users/current
API client: 0.225274682 API transaction
API client: 0.000126819 Parse response
API client: 0.000512186 Prepare request https://localhost:3030/arvados/v1/collections  {"uuid":["zzzzz-4zz18-duplicatenames1"]}
API client: 0.088132181 API transaction
API client: 0.00014334 Parse response
API client: 0.000367625 Prepare request https://localhost:3030/arvados/v1/groups/zzzzz-tpzed-xurymjxw79nv3jz
API client: 0.081459805 API transaction
API client: 0.000302036 Prepare request https://localhost:3030/arvados/v1/collections
API client: 0.110496374 API transaction
API client: 0.000200767 Parse response
Redirected to http://localhost:6300/collections/bcsdv-4zz18-vx2vlzxizx8ozxh
API client: 0.000469982 Prepare request https://localhost:3030/arvados/v1/links
API client: 0.112106413 API transaction
API client: 0.000246148 Parse response
Completed 302 Found in 642ms (ActiveRecord: 0.0ms)

In case of (2): would it be possible to push this whole combining manifest texts business into keep instead of generating the combined manifest text in workbench?

We've discussed lots of options along these lines, and there are a lot of good ideas here. But I believe they need more planning before they can be implemented.

  • As for the current implementation: I think it would be nice to use the “distinct” feature added as part of #5365 at line #135 (select([:uuid, :portable_data_hash]).each …) to get just one uuid for the pdh from api server. Otherwise, we will still get all the uuids matching this pdh, which will be costly if we are working with any of the most popular pdhs.

distinct only removes completely duplicate records from the results. Because we're requesting object UUIDs, there are no duplicate records in the results, and distinct will have no effect. Here's an illustration using the Python SDK. The query is the same as the Workbench code makes; but even though it says distinct=True, the API server returns five results (the limit):

>>> cl = api.collections().list(
          filters=[['portable_data_hash', 'in', ['d41d8cd98f00b204e9800998ecf8427e+0']]],
          select=['uuid', 'portable_data_hash'],
          distinct=True,
          limit=5).execute()
>>> pprint.pprint(cl['items'])
[{u'kind': u'arvados#collection',
  u'portable_data_hash': u'd41d8cd98f00b204e9800998ecf8427e+0',
  u'uuid': u'qr1hi-4zz18-009q2812m93xtqj'},
 {u'kind': u'arvados#collection',
  u'portable_data_hash': u'd41d8cd98f00b204e9800998ecf8427e+0',
  u'uuid': u'qr1hi-4zz18-01ii7cw6q7l5ba1'},
 {u'kind': u'arvados#collection',
  u'portable_data_hash': u'd41d8cd98f00b204e9800998ecf8427e+0',
  u'uuid': u'qr1hi-4zz18-01j3h0jw9dtu1y4'},
 {u'kind': u'arvados#collection',
  u'portable_data_hash': u'd41d8cd98f00b204e9800998ecf8427e+0',
  u'uuid': u'qr1hi-4zz18-03e0as45xin25po'},
 {u'kind': u'arvados#collection',
  u'portable_data_hash': u'd41d8cd98f00b204e9800998ecf8427e+0',
  u'uuid': u'qr1hi-4zz18-040l3p2c98ats0o'}]
  • Regarding generating provenance graph links after rendering the new collection: would there be a situation where we redirect user to the new collection (indicating successful processing), and then we experience errors while creating the provenance graph links? Also, would there be a situation where the user sees the new collection and clicks on the provenance graph link while the links are still being created? In light of these issues, I am thinking it is better to complete this (as before) before redirecting the user to the new collection.

You are right about all the potential problems. But the old code has other problems: if there's an error creating a link, it will report that error to the user, even though the collection has already been made. So the old code can create situations where Workbench creates the collection, then tells the user, "Sorry, I couldn't handle your request"—but the new collection will appear in listings as the user continues to work. #4943 reports an issue along these lines.

In general, I feel like we don't have a good model for handling this kind of error reporting when we need to create multiple objects, and some of them fail. In this specific case, I feel like the collection is the focus, and the links are nice-to-have, so it makes sense to let the user know as soon as creating the collection has succeeded. But if you still feel that the problems with the old code are better to have, I can revert these changes.

  • test_collection.rb => test_copy_stream_over_file_raises_ENOTDIR
    Should line 229 use “source” instead of “./s1”?

Yes it should. Thank you very much for catching that. Now at b1aa428.

#7 Updated by Radhika Chippada about 4 years ago

  • Yes, unfortunately distinct does not help in this case. Please ignore that comment.
  • As for provenance graph related concerns: In the event that something went wrong after the user is redirected to the new collection page, what happens if the user wants to see the prov graph? Does it ever recover? If not, I think it is desirable that the user sees error during the provenance graph link creations, and starts all over again and creates yet another collection. I think this way the user will have a fully functioning collection as designed by the system. Since this update affects all combine cases, not just large collections, I think it is desirable to have it work correctly in all cases. What do you think?
  • Too bad we are not implementing a full solution to address the performance issue here. I am ok if you want to merge this code into master knowing that this is most likely not going to resolve the performance issues. Thanks.

#8 Updated by Brett Smith about 4 years ago

Radhika Chippada wrote:

  • As for provenance graph related concerns: In the event that something went wrong after the user is redirected to the new collection page, what happens if the user wants to see the prov graph? Does it ever recover? If not, I think it is desirable that the user sees error during the provenance graph link creations, and starts all over again and creates yet another collection. I think this way the user will have a fully functioning collection as designed by the system. Since this update affects all combine cases, not just large collections, I think it is desirable to have it work correctly in all cases. What do you think?

I have gone back to creating the links before redirecting the user. As a quick patch to try to address some of the problems I discussed, I added a flash to the page to let the user know if an error occurred creating these links. Ready for another look at 5ef491b.

#9 Updated by Brett Smith about 4 years ago

  • Status changed from In Progress to New
  • Target version changed from 2015-04-29 sprint to Bug Triage

Moving back to bug triage to encourage consideration and planning of some of the larger development efforts discussed above.

#10 Updated by Brett Smith about 4 years ago

  • Assigned To deleted (Brett Smith)
  • Target version changed from Bug Triage to Arvados Future Sprints

Also available in: Atom PDF