Story #6203

[API] [Performance] Optimize time spent on the API server side during a large collection creation.

Added by Radhika Chippada about 5 years ago. Updated about 5 years ago.

Assigned To:
Radhika Chippada
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


See for complete detail.

It seems we can reduce several seconds in the overall time taken to create a large collection by optimizing the following:

The workbench test takes 21 seconds to do create. Of these 21 seconds, 17seconds is being spent in the following areas on the API server side:

App 7227 stdout: *****  check_signatures start 2015-05-29 16:03:57 -0400
App 7227 stdout: *****  compute_pdh start 2015-05-29 16:03:57 -0400
App 7227 stdout: *****  compute_pdh end 2015-05-29 16:03:59 -0400
App 7227 stdout: *****  check_signatures end 2015-05-29 16:04:04 -0400
App 7227 stdout: *****  strip_manifest_text start 2015-05-29 16:04:04 -0400
App 7227 stdout: *****  strip_manifest_text end 2015-05-29 16:04:06 -0400
App 7227 stdout: *****  set_portable_data_hash start 2015-05-29 16:04:06 -0400
App 7227 stdout: *****  set_portable_data_hash end 2015-05-29 16:04:08 -0400
App 7227 stdout: *****  signed_manifest start 2015-05-29 16:04:09 -0400
App 7227 stdout: *****  signed_manifest end 2015-05-29 16:04:14 -0400

And, compute_pdh is being invoked twice.


Task #6113: Review branch: 6203-collection-perf-apiResolvedRadhika Chippada

Related issues

Related to Arvados - Story #6092: [TBD] Improve the performance of the worst-performing collections componentResolved

Related to Arvados - Story #6270: [SDKs] [Performance] Ruby SDK performance metricsClosed06/09/2015

Associated revisions

Revision bed1ec60
Added by Tom Clegg about 5 years ago

Merge branch '6203-locator-regexp' refs #6203 refs #6277

Revision 361e83e8 (diff)
Added by Tom Clegg about 5 years ago

Update arvados in bundle. refs #6203 refs #6277

Revision 1a98070d
Added by Radhika Chippada about 5 years ago

closes #6203
Merge branch '6203-collection-perf-api'


#1 Updated by Brett Smith about 5 years ago

  • Assigned To set to Radhika Chippada
  • Target version set to 2015-06-10 sprint

#2 Updated by Radhika Chippada about 5 years ago

  • Status changed from New to In Progress

#3 Updated by Radhika Chippada about 5 years ago

Below are the numbers after I tweaked the above mentioned methods:

Workbench tests: test/integration_performance/collection_unit_test.rb
In my branch:
0.158198771s build example
0.512019084s new (manifest size = 6MiB)
9.831454755s create
3.164605099s read
0.013937852s read(cached)
6.026941976s list
4.689271296s update(name-only)
13.405557288s update
3.584947671s delete
0.052028055s read(404)
BigCollectionTest#test_crud_cycle_for_collection_with_big_manifest_(compress=true) = 42.39 s = .

In master:
0.130952126s build example
0.525899681s new (manifest size = 6MiB)
14.075667663s create
4.108906145s read
0.012675329s read(cached)
8.084185956s list
6.344242344s update(name-only)
19.684942139s update
4.963485125s delete
0.039076585s read(404)
BigCollectionTest#test_crud_cycle_for_collection_with_big_manifest_(compress=true) = 58.95 s = .

API TESTS: test/integration/collections_performance_test.rb
In my branch:
CollectionsApiPerformanceTest#test_crud_cycle_for_a_collection_with_a_big_manifest = 0.28716071s make example
0.134842282s JSON encode 18MiB manifest
11.402078717s create
3.083956521s read
3.086899177s list
12.464819769s update
3.742708626s delete
35.18 s = .

In master:
CollectionsApiPerformanceTest#test_crud_cycle_for_a_collection_with_a_big_manifest = 0.296271169s make example
0.131979662s JSON encode 18MiB manifest
16.23883748s create
3.903416969s read
3.928377929s list
18.550129007s update
4.560380538s delete
48.61 s = .

#4 Updated by Tom Clegg about 5 years ago

6203-collection-perf-api @ 12afe0e

compute_pdh and caching

I think the new compute_pdh method is dangerous because it doesn't necessarily compute the pdh. For example, this looks like pdh2 will come from c2.manifest_text, but it doesn't:
  • pdh1 = c1.compute_pdh
    c1.manifest_text = c2.manifest_text
    pdh2 = c1.compute_pdh
I'm finding the following things more confusing than they deserve to be:
  • @need_pdh_validation
  • compute_pdh()
  • @computed_pdh
  • computed_pdh()
  • computed_pdh (local variable in check_signatures)
Perhaps we can make this correct+understandable+fast all at once by doing something like this:
  • compute_pdh method computes the pdh right now.
  • computed_pdh method implements a dependable cache of compute_pdh by doing something like this:
    • def computed_pdh
        if @computed_pdh_for_manifest_text == manifest_text
          return @computed_pdh
        @computed_pdh = compute_pdh
        @computed_pdh_for_manifest_text = manifest_text.dup
  • Get rid of @need_pdh_validation -- just call computed_pdh instead of compute_pdh. In fact, the whole set_portable_data_hash method seems to exist to [a] set a flag that validation is needed, which is irrelevant since validation is always fast now that we have computed_pdh, and [b] transparently modify the "size" part of the client-provided locator to the actual size of portable_manifest_text, which (if we need it at all) would make more sense as an explicit "auto-correct" step in ensure_hash_matches_manifest_text.
  • Get rid of the computed_pdh local variable in check_signatures and use the computed_pdh method instead.
  • After strip_manifest_text, take care of the special case where @computed_pdh is still valid even though we changed manifest_text (i.e., if @computed_pdh_for_manifest_text == manifest_text before stripping, then set @computed_pdh_for_manifest_text = stripped_manifest_text after stripping).

Aside: I think strip_manifest_text_and_... should be renamed to strip_signatures_and_.... The term "strip" is vague, and confusing if you've recently looked at the Python SDK where it refers to what portable_manifest_text does here.

Aside: ensure_hash_matches_manifest_text should probably be renamed to ensure_pdh_matches_manifest_text.

New munge method

The new version no longer modifies its argument, so it should be renamed to just munge_manifest_locators.

manifest.dup is unnecessary in sign_manifest now that munge doesn't modify its argument. In fact, it can now reduce to:

-    m = manifest.dup
-    m = munge_manifest_locators!(m) do |match|
-      Blob.sign_locator(locator_without_signature(match), signing_opts)
-    end
-    return m
+    munge_manifest_locators manifest do |match|
+      Blob.sign_locator(locator_without_signature(match), signing_opts)
+    end

Likewise, in portable_manifest_text:

-    portable_manifest = self[:manifest_text].dup
-    portable_manifest = self.class.munge_manifest_locators!(portable_manifest) do |match|
+    portable_manifest = self.class.munge_manifest_locators!(manifest_text) do |match|

Every non-zero manifest ends with "\n" so I don't think we need the "omit trailing \n if the input didn't have one" code. The only valid case this covers can be handled more simply by putting return "" if manifest=="" right at the top. Likewise, "return nil if manifest.nil?" should eliminate the .andand stuff.

Instead of appending "\n" to the string, you can do this, to avoid that last (and presumably most expensive) string concatenation:

-    if !new_lines.empty?
-      ends_with_newline = manifest.end_with?("\n")
-      manifest = new_lines.join("\n")
-      manifest += "\n" if ends_with_newline
-    end
+    (new_lines + ['']).join("\n")

In self.sign_manifest we now sign locator_without_signature(match). Is there ever a case where sign_manifest is called before the signatures are stripped? I don't think there should be, in which case this could be more efficient as just match[0], instead of locator_without_signature(match). AFAICT, we didn't need to remove signatures here before this branch...

There are two uses of "word.strip" here which seem to be superfluous, since word comes from line.split(' ').

Could we use Ruby's built-in iterators (manifest.each_line and line.each_line(' ')) instead of actually instantiating arrays with split()? Manifests and lines can be many megabytes long, and we don't actually use the third copy of the manifest in order to do what we need. Assuming this change doesn't make performance worse, it seems like a worthwhile memory-consumption optimization. Unfortunately each_line includes the trailing separator with each yielded item, which might make the blocks a bit ugly. We could punt on this -- but please do note if you already tried it and found performance was worse.


This method looks like it does a lot more work than it should. It only ever needs to do a single replacement. Is the split() and the N appends really faster than this?

  def self.locator_without_signature match
    match[0].sub(/\+A[^+]*/, '')

combined validation method

I guess you don't like "_maybe_" in the name, but how about "..._update_replication_confirmed"? I think the new name "..._and_clear_replication_confirmed" suggests it's always going to clear those attrs, making it less helpful than the previous name.

This explanatory comment should probably move to the appropriate place in the new code instead of being lost:
  • -      # If the new manifest_text contains locators whose hashes
    -      # weren't in the old manifest_text, storage replication is no
    -      # longer confirmed.

We should check self.replication_confirmed.nil? directly, instead of adding a new cleared_replication_confirmed variable to track it: the "not in_old_manifest" check is equally useless regardless of when/why replication_confirmed became nil. We should also skip building the in_old_manifest hash in the first place if self.replication_confirmed is already nil. We should also add something to the API unit tests to reveal how slow the replication_confirmed check is -- IIRC all existing performance tests will skip it once we make this optimization.

AIUI, the reason we use self[:manifest_text] instead of the usual manifest_text is to sneakily avoid setting the ActiveRecord "changed" flag when stripping signatures. I think it would be good form to avoid using it in any cases where it isn't necessary. Here, we need to use it when setting, but not when getting.

The part where we build the in_old_manifest hash used to use an each_manifest_locator method which was more efficient because it didn't build a new manifest: it just visited every locator, and ignored the block's return values. The new code loses this optimization. Instead of dropping the each_manifest_locator method, I think we should give it the same optimizations that munge got (i.e., use each or split instead of scan) and use it to build in_old_manifest. That should make it even faster (and use less memory) than munge.


This seems to be copied straight from the Ruby SDK. Can we require 'arvados/keep' and use Keep::Locator::LOCATOR_REGEXP in collection.rb, instead of copying & pasting?

It doesn't seem to have any unit tests. Can we add some in the Ruby SDK test suite?

We pass around "match" objects that depend on the position of regexp matches (match[4] etc). If this is meant to be a supported use of LOCATOR_REGEXP, we should write a test to ensure the right pieces show up in the right positions.

If the =~ operator is faster than match() then Keep::Locator::valid() should be updatd to use =~ too.

LOCATOR_REGEXP erroneously passes the following bogus locators, (I think) because of a spurious "+" in the "hint content" part:


It would be great if we could use the exact regexp documented on the Keep locator format wiki page -- possibly with only the () groupings added.

#5 Updated by Tom Clegg about 5 years ago

6203-collection-perf-ruby-sdk @ 77bf746


The splits variable is poorly named and never used anyway so this should be compressed back to

-        splits = line.split
-        splits.each do |token|
+        line.split.each do |token|


The colons have no apparent reason to be in regexp groups. How about:
  • -    FILE_REGEXP = /^(\d+)(:)(\d+)(:)(.*)$/
    +    FILE_REGEXP = /^(\d+):(\d+):(.*)$/


I don't see why it's a good idea to re-implement each_file_spec here.

The new method seems to do exactly the same amount of work as the old version, except that it saves work by introducing bugs (e.g., it doesn't unescape stream and file names, and it doesn't realize ["./foo","bar"] and [".","foo/bar"] are the same file).

One non-bug tweak I see here is keeping a counter instead of calling seen_files.size each time -- but a quick benchmark indicates the counter method is actually slower, especially if you include the extra hash.include? check before incrementing the counter.

require 'benchmark'
n=4000000 do |x|'Hash#size') { h={}; for i in 1..n; h[h.size.to_s]=true; end }'counter1') { h={}; c=0; for i in 1..n; c+=1; h[c.to_s]=true; end }'counter2') { h={}; c=0; for i in 1..n; c+=1 unless h.include? 'foo'; h[c.to_s]=true; end }

user system total real
Hash#size 6.200000 0.150000 6.350000 ( 6.360331)
counter1 8.740000 0.040000 8.780000 ( 8.801807)
counter2 9.750000 0.100000 9.850000 ( 9.878150)

The other non-bug tweak I see is that it uses FILE_REGEXP instead of not Locator.valid?. Is FILE_REGEXP faster than LOCATOR_REGEXP? If so, we should use that to make each_file_spec faster. (It might even turn out that the old version is the fastest if we fix valid? to use =~ instead of match()...)

(Meanwhile, files_size could conceivably benefit by doing something simpler than each_file_spec, because it returns the same number regardless of how the files are named, whether they're fragmented, etc. However, we could easily shoot ourselves in the foot this way by computing files_size a fast way and then having to run the slow loop anyway because the caller also wants to know files_count. So I don't think this is a worthwhile investment.)

Optimizing for the case where files_count and files_size are called, but files() isn't called, has the same problem -- but at least we know one case where it happens, and it does make it possible to get the file count and size without keeping the big @files array in memory, so it seems worth keeping.

But if we expect callers to invoke both files_count and files_size, we should avoid being sensitive to the order in which they're called, like this:

def files_count(stop_after=nil)
  @files_size = total_size
  @files_count = count_so_far

def files_size
  return @files_size if @files_size
  files.reduce(0) { |total, (_, _, size)| total + size }


There are various new code paths here that could return different results depending on what order the methods are called. If that's our strategy, we need to test accordingly.

Also, when we introduce more manifest-parsing/iterating code (which we should try to avoid) it should at least get tested on things like COLON_FILENAME_MANIFEST, MANY_ESCAPES_MANIFEST, and MULTILEVEL_MANIFEST_WITH_DIRS_IN_FILENAMES.

#6 Updated by Radhika Chippada about 5 years ago

Responses to

  • Addressed the comments in “compute_pdh and caching”
  • Updated munge method accordingly.
    • Also, removed “ends_with_newline” and instead corrected the failing unit test to include “\n” at the end of manifest text.
    • “In self.sign_manifest we now sign locator_without_signature(match) …” You are right. This is my mistake and corrected it.
    • manifest.each_line is indeed faster than split(“\n”). However this still includes the trailing new line character. Stripping it using rstrip! made it the same as split(“\n”). However, given the memory optimization, this is better and hence updated accordingly.
    • Continuing to use split instead of each_line(‘ ‘) since this is resulting in some performance loss. We can revisit this if in future we experience memory issues.
  • locator_without_signature: using sub (gsub in fact) as you suggested, resulted in another 2s gain in api integration test performance.
  • combined validation method
    • Added the suggested enhancements
    • Did not add tests in the interest of time + as you said no real point in doing this at this stage
    • Please check self[:manifest_text] usages; I could not see any misuses after the updates.
    • Switched to Keep::Locator::LOCATOR_REGEXP as you suggested
    • Did not add the tests; let’s do this as part of #6270
    • I also did not update the LOCATOR_REGEXP. Let’s try to do it as part of #6270 or #6271 rather than adding more to this story.


#7 Updated by Radhika Chippada about 5 years ago

Regarding "Review branch: 6203-collection-perf-ruby-sdk" and

I removed this review task and instead created #6270 and #6271. Based on what we learned from #6087 and #6203 (benchmarking API methods and improving performance), I think there are no short cuts and we need to do a thorough job of first measuring the performance of various methods and making updates to the ruby SDK based on those findings. Thanks for reviewing though.

#8 Updated by Tom Clegg about 5 years ago

Updates look good. I've pushed a few more fixes on 6203-collection-perf-api-TC. Mostly comments and clarity, but also saved a bit of time in "sign" by removing an unused split() left over from changing to each_line. Can you take a look, and merge if these tweaks look OK?

#9 Updated by Tom Clegg about 5 years ago

6203-collection-perf-api-TC @ 02156d7

#10 Updated by Radhika Chippada about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:1a98070dcdd379ab6effdc3fc4301c9435a137b1.

Also available in: Atom PDF