Story #6218

[SDKs] [Performance] Add performance tests for Python SDK collection methods

Added by Brett Smith over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
06/19/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Short version: implement a separate test suite that reports timing profiles. Eventually we'll have performance tests for the entire SDK, but for this ticket, just write tests to exercise collection components:

  • Instantiating a Collection
  • Listing files in a Collection
  • Updating one Collection by copying files from another, and updating
  • Creating a Collection from scratch by instantiating a new Collection, copying files to it from an existing Collection, and saving
  • (Others? But again, the focus is on Collections performance. In particular, avoid wrapping Keep client activity into this.)

When in doubt, for implementation guidance, follow the pattern for the API server and Workbench established in #6087, unless that's grossly un-Pythonic. Use the large collection test fixtures created during that development.


Subtasks

Task #6371: Add performance profiling framework to python sdkResolvedRadhika Chippada

Task #6372: Review branch: 6218-add-performance-profiling-to-python-sdkResolvedTom Clegg

Task #6293: Review 6218-python-sdk-perf @ commit:d4c6c7cResolvedTom Clegg

Associated revisions

Revision 69107d28
Added by Radhika Chippada over 6 years ago

refs #6218
Merge branch '6218-add-performance-profiling-to-python-sdk'

Revision 97f5239f
Added by Tom Clegg over 6 years ago

Merge branch '6218-python-sdk-perf' closes #6218

History

#1 Updated by Brett Smith over 6 years ago

  • Description updated (diff)
  • Category set to SDKs

#2 Updated by Radhika Chippada over 6 years ago

  • Subject changed from [SDKs] Add performance tests for Python SDK collection methods to [SDKs] [Performance] Add performance tests for Python SDK collection methods

#3 Updated by Tom Clegg over 6 years ago

  • Assigned To set to Tom Clegg

#4 Updated by Peter Amstutz over 6 years ago

See https://arvados.org/issues/3198#note-39 and https://arvados.org/issues/3198#note-40 for some benchmarking notes around FUSE and Collections.

#5 Updated by Brett Smith over 6 years ago

Here are some thoughts about how we might go about implementing this. These are just my suggestions based on familiarity with Python. They're not requirements or acceptance criteria for the story, unless Tom says, "Yes, implement it this way."

To me, the most important thing is that our profiling suites be consistent across modules: the Python SDK and FUSE should do it the same way if that's even a little doable. That will save us the most mental overhead in the long run.

With that said, here's my main idea:

Put the profiling tests in a dedicated subdirectory, profiling or performance or something. I'm happy for other people to bikeshed the name. ;)

From here, you have (at least) two options:

  1. If you want to use unittest.TestCase subclasses to organize the preparation and test code, great! Go ahead and write a module of tests just like we have under tests, where each file matches test_*.py. It's easy to run the entire module of test cases from the component root by running python -m unittest discover SUBDIR. Or we can rig up fancier loading and running if we want or need that—see pydoc unittest.TestLoader.discover for what you can do.
  2. If unittest doesn't provide a good organization strategy, no big deal. I don't see a problem with just writing our own script or module to do testing in the subdirectory. Just try to write something that's general enough that there's a clear way to add more benchmarks in the future—possibly for non-collections operations.

However it's scaffolded, the profiling tests should use the standard profile module. profile.run() executes a statement and can write a file (specified by the optional second argument) with tons of information about the time spent in different functions while running that code. For each operation that we instrument, save the file to a dedicated place with a nice name. For example, if you're using unittest, it could match the test_ method name. Save the output somewhere under our dedicated subdirectory tree. Add a pattern for those files to .gitignore.

(Silly implementation detail: prefer cProfile when it's available for better performance:

try:
    import cProfile as profile
except ImportError:
    import profile
    # Maybe calibrate the profiler here, per the module docs.
    # I don't think that's critical for 1.0, though.

)

I think simply generating the file should be enough to give us the data we need to see if there are hotspots where we need to improve performance. However, if you want to parse out specific numbers (to report them directly from the tests, compare them against known benchmarks, or whatever), you can do that with the pstats module.

#6 Updated by Radhika Chippada over 6 years ago

Tom: I added performance profiling to python sdk, which can be used from fuse as well. Sorry, I did not get to discuss it with you before I started working on it yesterday afternoon. Can you please take a look at the branch 6218-add-performance-profiling-to-python-sdk and provide me your comments.

I will now start working on implementing the performance profiling tests in fuse on this foundation. Thanks.

#7 Updated by Tom Clegg over 6 years ago

Works for me.

I've pushed 6218-add-performance-profiling-to-python-sdk-TC with a couple of changes:
  • Profiling refactored as a decorator: putting @profiled on a test function is enough to enable profiling for that test
  • Save the result to tmp/profile/{function name} without the timestamp. I figure log/report rotation should be done out of band, since it's easier for (say) a jenkins script to grab the profile data it needs from a file (without wildcards) and archive as it pleases.

The regular test suite seems to include the performance tests as well, but the comments in performance_profiler.py seem to suggest that you need to do something special to run them. Which way did you intend it to work?

The symlink services/fuse/tests/performance/performance_profiler.py should use a relative path -- currently it points to /home/radhika/... which won't work for most setups. :P

#8 Updated by Radhika Chippada over 6 years ago

Tom, thanks for noticing the symlink issue; I corrected it. Please check.

Also, regarding the comment in performance_profiler.py, it is true the perf tests run when run in the entire suite. But given it is a subdir, I felt the comment might come handy for someone looking for a hint to run only those tests. I updated it as such. Please do let me know if you would like me to remove this comment.

Also, noticed that the exception handling code is throwing an error in performance_profiler.py. Updated it as follows:

-        except e:
-            caught = e
+        except:
+            caught = sys.exc_info()[0]

Thanks.

#9 Updated by Tom Clegg over 6 years ago

Radhika Chippada wrote:

Tom, thanks for noticing the symlink issue; I corrected it. Please check.

Looks good, thanks. (I think this and "run test servers" should really be in a separate package, so we don't have to use symlinks and other references across the source tree, but that's an adventure for another day.)

Also, regarding the comment in performance_profiler.py, it is true the perf tests run when run in the entire suite. But given it is a subdir, I felt the comment might come handy for someone looking for a hint to run only those tests. I updated it as such. Please do let me know if you would like me to remove this comment.

No, I just wanted to clarify. The updated comment is better, thanks.

Also, noticed that the exception handling code is throwing an error in performance_profiler.py. Updated it as follows:

Whoops, it looks like I thought I was writing Ruby or something.

But I think this is the right idiom:
  • except Exception as e:
        caught = e
    

#10 Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress

#11 Updated by Tom Clegg over 6 years ago

6218-python-sdk-perf @ d4c6c7c

  • This branch isn't a complete solution, but it provides some data points.
  • It adds ~30 seconds to the test suite in my dev environment. (Should we reduce this by using a smaller example? Or stick with the same example used in Workbench and API tests?)

#12 Updated by Brett Smith over 6 years ago

  • Target version changed from 2015-07-08 sprint to 2015-07-22 sprint

#13 Updated by Tom Clegg over 6 years ago

  • Story points changed from 2.0 to 0.5

#14 Updated by Brett Smith over 6 years ago

Reviewing 6218-python-sdk-perf @ d4c6c7c.

Code issues:

  • It looks like the benchmark class expects a real API server. If that's right, it needs to set MAIN_SERVER = {} in the class body to tell TestCaseWithServers to start the API server. Refer to that class' docstring for details. Right now, trying to run the test case in isolation fails, apparently because of this oversight.
  • I admit this was not very well-specified, but I expected profile_new_collection_from_collection_files to use the Collection copy method. That keeps work at the manifest level and out of Keep, which is more in line with this story's scope.
  • In the line data_loc = '%s+%d'.format(hashlib.md5(datablip).hexdigest(), bytes_per_block) I'm afraid you've gotten your C-style and new-style formatting mixed up here. This evaluates to '%s+%d'. The string here should be something like '{}+{:d}'.

If you're fine with making straightforward fixes for those, I think the branch is good to merge after that.

Is there a specific reason you didn't write one test method for benchmarked thing? (Even just, "It was faster that way?") I see that you need to keep the setup code out of the profiled methods—I guess I'm wondering if there's some architectural issue that will make that sort of organization difficult in the future.

And then, for a curiosity: I noticed that the results for listings were dominated by the list manipulation in the test code itself, rather than anything in Collections. To try to reduce that overhead, I wrote the attached patch to make the list method an iterator rather than building and returning a list. With that implementation, here's the results:

         2 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 /home/brett/repos/arvados/sdk/python/tests/test_benchmark_collections.py:54(profile_collection_list_files)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Apparently iteration happens at a low enough level that it becomes invisible to the profiler? Which is kind of interesting in its own right. I honestly can't decide if this approach and these results are "better"—there's less measured, but if the whole method runs so quickly the profiler can't even measure it, who cares? I'll leave the decision about whether or not to use this patch up to you. I more thought it was interesting to observe the result.

Thanks.

#15 Updated by Tom Clegg over 6 years ago

Brett Smith wrote:

Reviewing 6218-python-sdk-perf @ d4c6c7c.

Code issues:

  • It looks like the benchmark class expects a real API server. If that's right, it needs to set MAIN_SERVER = {} in the class body to tell TestCaseWithServers to start the API server. Refer to that class' docstring for details. Right now, trying to run the test case in isolation fails, apparently because of this oversight.

Whoops, fixed. (I didn't notice because I was testing with "run-tests.sh --retry sdk/python_test=...", which means I had an API server whether or not the test case asked for it.)

  • I admit this was not very well-specified, but I expected profile_new_collection_from_collection_files to use the Collection copy method. That keeps work at the manifest level and out of Keep, which is more in line with this story's scope.

Ah, I actually tried to do that but didn't find the right copy() method. I've added that; it takes 10 seconds here, compared to 17 for the "copy data" version. (It seems to me both are interesting so I'm leaving the "copy data" version in too.)

  • In the line data_loc = '%s+%d'.format(hashlib.md5(datablip).hexdigest(), bytes_per_block) I'm afraid you've gotten your C-style and new-style formatting mixed up here. This evaluates to '%s+%d'. The string here should be something like '{}+{:d}'.

Whoops, fixed.

If you're fine with making straightforward fixes for those, I think the branch is good to merge after that.

Done.

Is there a specific reason you didn't write one test method for benchmarked thing? (Even just, "It was faster that way?") I see that you need to keep the setup code out of the profiled methods—I guess I'm wondering if there's some architectural issue that will make that sort of organization difficult in the future.

No, separate tests would work fine. I did it this way just to avoid needlessly repeating the test setup work ("build a big manifest").

The caveat for the profiling code is that you wouldn't want to do this

@profiled
def p1(self):
    pass

@profiled
def test_foo(self):
    p1()

@profiled
def test_bar(self):
    p1()

...because the second p1() would overwrite the results for the first p1().

But this would work fine:

def setUp(self):
    pass

def p1(self):
    pass

@profiled
def test_foo(self):
    p1()

@profiled
def test_bar(self):
    p1()

And then, for a curiosity: I noticed that the results for listings were dominated by the list manipulation in the test code itself, rather than anything in Collections. To try to reduce that overhead, I wrote the attached patch to make the list method an iterator rather than building and returning a list. With that implementation, here's the results:

OK, I like the generator refactor. I've applied your patch.

Apparently iteration happens at a low enough level that it becomes invisible to the profiler? Which is kind of interesting in its own right. I honestly can't decide if this approach and these results are "better"—there's less measured, but if the whole method runs so quickly the profiler can't even measure it, who cares? I'll leave the decision about whether or not to use this patch up to you. I more thought it was interesting to observe the result.

I think your version measures only the time to get to the first "yield", though: after that, the profiled method returns, and profiling stops. But SDK code runs later when the generator is consumed, and (according to me) that code is worth measuring. So I moved the sum(1 for ...) bit into the profiled method. Now the sum() part is being measured too, but now we're measuring "file entry generator + simplest imaginable consumer". And it reports sensible results:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
40101/10001    0.207    0.000    0.672    0.000 /home/tom/src/arvados/sdk/python/tests/test_benchmark_collections.py:15(list_recursive)
    10201    0.093    0.000    0.443    0.000 /home/tom/src/arvados/sdk/python/arvados/arvfile.py:233(synchronized_wrapper)
    10201    0.088    0.000    0.146    0.000 /usr/lib/python2.7/threading.py:121(acquire)

The test function is still the most time-consuming part, and it doesn't look like file listing is a big optimization target. But we have a number to stick on it, which is basically what we're after here, right...

#16 Updated by Tom Clegg over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:97f5239f053b1691d7f2cb56230386921f8ea4d4.

#17 Updated by Brett Smith over 6 years ago

Tom Clegg wrote:

Brett Smith wrote:

Apparently iteration happens at a low enough level that it becomes invisible to the profiler? Which is kind of interesting in its own right. I honestly can't decide if this approach and these results are "better"—there's less measured, but if the whole method runs so quickly the profiler can't even measure it, who cares? I'll leave the decision about whether or not to use this patch up to you. I more thought it was interesting to observe the result.

I think your version measures only the time to get to the first "yield", though: after that, the profiled method returns, and profiling stops. But SDK code runs later when the generator is consumed, and (according to me) that code is worth measuring. So I moved the sum(1 for ...) bit into the profiled method. Now the sum() part is being measured too, but now we're measuring "file entry generator + simplest imaginable consumer". And it reports sensible results:

Ohhhhhh, of course you're right. I should've caught that, that makes a lot more sense. Thanks for spelling it out and finishing up where I left off.

The test function is still the most time-consuming part, and it doesn't look like file listing is a big optimization target. But we have a number to stick on it, which is basically what we're after here, right...

Yes, definitely. Knowing that it's not worth optimizing this is valuable information.

Also available in: Atom PDF