Bug #5037

[SDKs] Improve Python SDK thread safety and document pitfalls

Added by Brett Smith over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
01/20/2015
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
0.5

Description

One of our users wrote a Crunch script that, in part, does something like this:

import arvados
import os

def loc_size(coll_loc):
    # This function just needs to read the entire collection.
    # Details aren't important.
    cr = arvados.CollectionReader(coll_loc)
    return sum(len(data) for cf in cr.all_files() for data in cf.readall())

cr = arvados.CollectionReader(COLL_LOC1)
print list(cr.all_streams())

child_pid = os.fork()
if not child_pid:
    print "Child size:", loc_size(COLL_LOC2)
else:
    print "Parent size:", loc_size(COLL_LOC2)
    print os.waitpid(child_pid, 0)

This breaks horribly because all the CollectionReaders end up implicitly sharing a cached API object, and that object is not threadsafe. (It specifically manifests as SSL record errors like "wrong version number" and "decryption failed or bad record mac".)

We need to help users avoid these pitfalls. This could be a combination of SDK code changes and documentation noting the issues. We need more discussion to figure out what's worth doing.


Subtasks

Task #5117: Review 5037-python-sdk-thread-safeResolvedTom Clegg

Task #5116: Remove connection pooling feature from arvados.api()ResolvedTom Clegg

Task #5166: Remove cache=False from Python SDK clients (bump version requirements instead)ResolvedTom Clegg

Task #5170: Review 5037-nonocacheResolvedTom Clegg

Associated revisions

Revision 9af0808e
Added by Tom Clegg over 4 years ago

Merge branch '5037-python-sdk-thread-safe' refs #5037

Revision 1128f6e0
Added by Tom Clegg over 4 years ago

Merge branch '5037-nonocache' closes #5037

History

#1 Updated by Brett Smith over 4 years ago

One issue: things like CollectionReader currently get their own API object with just arvados.api('v1'). They don't use cache=False because that would download the API definition document every time, which is needlessly expensive. But this is partly because cache=False is semantically overloaded: it means both "don't use a cached object" and "don't use a cached API document." If we could get the former behavior without the latter, that would go a long way to avoiding surprises like this without too much additional overhead.

#2 Updated by Brett Smith over 4 years ago

I think I've convinced myself that we should just do away with the global API client cache. cache=False should just refer to reading+writing the cached discovery document, and should rarely be used.

Why:

  • Compared to the size of data Crunch script are generally working with, the API object is cheap. Especially once we've got the discovery document on disk, caching the object itself provides relatively little benefit.
  • Meanwhile, the benefits of being more threadsafe out of the box would be substantial for users, especially in cases like the example script where it's not clear that you have anything to worry about. (Your threads/children don't share the objects you're building, so what could go wrong?)
  • For users that know ahead of time that they're going to be building a lot of objects and want to share an API client, we now have the infrastructure to let them do that in a way that's safe for them. For example, see arv-mount. If we want to make this easier, we could even introduce an "API client cacher" base class to the SDK that people could extend to say when an API client object can and can't be reused.

#3 Updated by Peter Amstutz over 4 years ago

The SafeApi class in the fuse driver handles this by allocating a separate api object per thread. This generally addresses multithreading issues, but fork() is a different beast entirely.

#4 Updated by Peter Amstutz over 4 years ago

Proposal:

  1. Move SafeApi into the SDK
  2. arvados.api() returns/caches SafeApi objects
  3. Add a SafeApi.clone() method which returns a new SafeApi with the same credentials but without the "local" dict empty
  4. Document that if you need to use os.fork(), the child process needs to .clone() its api object before using it.

#5 Updated by Brett Smith over 4 years ago

Peter Amstutz wrote:

Proposal:

  1. Move SafeApi into the SDK
  2. arvados.api() returns/caches SafeApi objects
  3. Add a SafeApi.clone() method which returns a new SafeApi with the same credentials but without the "local" dict empty
  4. Document that if you need to use os.fork(), the child process needs to .clone() its api object before using it.

Right now a SafeApi object acts a lot like the underlying Google API client object, but doesn't look like it. Users who try to introspect this object by looking at __dict__, dir(), or hasattr() will get very different results between these two. If we want to take this approach, I think it would be prudent to extend it to imitate the original API client object much more closely.

#6 Updated by Tom Clegg over 4 years ago

SafeApi is only safe in a very specific way: you can call its methods from various threads and get a different underlying object per thread, but you get back objects that aren't threadsafe at all. I'm not convinced this should be advertised as "thread safe".

Turning off connection caching by default gives us a thread-safe module that returns thread-unsafe objects, which is a reasonable level of thread safety for everyone to understand/expect.

(In the example shown here, the developer knew not to share objects across threads/processes, and was bitten only because sneaky-but-not-sneaky-enough connection pool features caused them to be shared anyway.)

#7 Updated by Peter Amstutz over 4 years ago

Tom Clegg wrote:

SafeApi is only safe in a very specific way: you can call its methods from various threads and get a different underlying object per thread, but you get back objects that aren't threadsafe at all. I'm not convinced this should be advertised as "thread safe".

Turning off connection caching by default gives us a thread-safe module that returns thread-unsafe objects, which is a reasonable level of thread safety for everyone to understand/expect.

Yes, at minimum we should stop caching the api object returned by arvados.api() and always return a new one.

There is still a problem with classes like KeepClient that are otherwise thread safe but sometimes need an API object. These classes should not store a single API client object. In this case it needs one of:

  1. Require passing in an apiclient object on every method that uses one
  2. Take a configuration object in init and use that to initialize a new apiclient object every time one is needed.
  3. Add a clone() method to the api object, and use that to initialize a new apiclient object every time one is needed.
  4. Use something like SafeApi that wraps the configuration object and acts as an apiclient factory.

#8 Updated by Peter Amstutz over 4 years ago

Brett Smith wrote:

Right now a SafeApi object acts a lot like the underlying Google API client object, but doesn't look like it. Users who try to introspect this object by looking at __dict__, dir(), or hasattr() will get very different results between these two. If we want to take this approach, I think it would be prudent to extend it to imitate the original API client object much more closely.

Are there any critical use cases where we need to be able to introspect the object this way?

#9 Updated by Tom Clegg over 4 years ago

I think it's better to stay simple: offer a simple client object, and admit that it's not thread-safe, rather than offer an object that is sort-of-thread-safe (as long as you don't expect its methods to return objects that are themselves thread-safe). This is the approach taken by Google fwiw.

Perhaps SafeApi would be clearer and simpler (thus safer) if rephrased to explicitly act as a pool rather than a transparent proxy?

class ClientPool(object):
  """A pool of Arvados API client objects.

  Never returns the same client object to callers on different threads.
  The pool is thread-safe, but the client objects returned by getClient()
  are not thread-safe.
  """ 

  def __init__(self, *initargs):
    self.local = threading.local()
    self.initargs = initargs

  def getClient(self):
    if 'client' not in self.local.__dict__:
      self.local.client = arvados.api(*initargs)
    return self.local.client

clientPool = ClientPool('v1', False, ...)
user = clientPool.getClient().users().current().execute()

This would provide an easy way for callers to get the magic-connection-pooling facility we want to remove from arvados.api(), but only when they want it -- and, fwiw, it lets the caller control when getClient() happens.

In the case of arv-mount, it would make it obvious that a configured client pool is what's being passed around -- not a single client object. Obvious is good!

Are there any critical use cases where we need to be able to introspect the object this way?

For a client library I think we should have an attitude that
  • consumers will expect our interfaces to be like normal Python interfaces, and
  • consumers will want to do things we haven't deemed necessary to do ourselves.

In this case I think a Python programmer would expect hasattr(client,'specimens') to be True if client.specimens() is supported.

#10 Updated by Brett Smith over 4 years ago

Tom Clegg wrote:

Are there any critical use cases where we need to be able to introspect the object this way?

For a client library I think we should have an attitude that
  • consumers will expect our interfaces to be like normal Python interfaces, and
  • consumers will want to do things we haven't deemed necessary to do ourselves.

In this case I think a Python programmer would expect hasattr(client,'specimens') to be True if client.specimens() is supported.

This is what I meant to get at when I made my earlier, too-technical point.

I agree with everything in Tom's last comment. I would rather have the simple, clear semantics of "arvados.api() always returns a new object, that's not threadsafe" over something like "arvados.api() returns a wrapper object that's threadsafe as long as you're using Python's native threads." The former would've prevented trouble for the forking Crunch script; the latter would not.

#11 Updated by Tom Clegg over 4 years ago

Suggest:
  • Ignore cache=False or cache=True
  • Add new flag discoveryCache=False to turn off discovery doc caching (who would want this?)
  • Remove connection-pooling behavior from api(). Return a new object every time.
  • Document "value returned from api() is not thread-safe. DIY."
Future, maybe:
  • Emit a deprecation warning if cache=False provided

#12 Updated by Tom Clegg over 4 years ago

  • Subject changed from [SDKs] [DRAFT] Improve thread safety and document pitfalls to [SDKs] Improve Python SDK thread safety and document pitfalls
  • Story points set to 0.5

#13 Updated by Tom Clegg over 4 years ago

  • Subject changed from [SDKs] Improve Python SDK thread safety and document pitfalls to [SDKs] [DRAFT] Improve thread safety and document pitfalls
  • Story points deleted (0.5)

#14 Updated by Tom Clegg over 4 years ago

  • Subject changed from [SDKs] [DRAFT] Improve thread safety and document pitfalls to [SDKs] Improve Python SDK thread safety and document pitfalls
  • Story points set to 0.5

#15 Updated by Tom Clegg over 4 years ago

  • Target version changed from Bug Triage to 2015-02-18 sprint

#16 Updated by Tom Clegg over 4 years ago

  • Assigned To set to Brett Smith

#17 Updated by Tom Clegg over 4 years ago

  • Assigned To changed from Brett Smith to Tom Clegg

#18 Updated by Brett Smith over 4 years ago

We just got a report that the GATK-merge-call Crunch script, included with Arvados, is failing because of this issue. I think getting this fixed is a priority—it would unblock users, and save us the hassle of trying to track down and fix individual instances of this everywhere. It also convinces me that we need to return wholly new objects every time. In order for this to be effective, the solution needs to be fork-safe as well as thread-safe.

#19 Updated by Tom Clegg over 4 years ago

  • Status changed from New to In Progress

#20 Updated by Tom Clegg over 4 years ago

ae7a6c9 on 5037-python-sdk-thread-safe

#21 Updated by Brett Smith over 4 years ago

Now we're at 1eeb668. My commits add a test for the new behavior, fix a Websockets test that broke for me (although I'm 99% sure that's just coincidental bad luck, but it was easy enough to fix—see commit message), and remove an unneeded cache=False from arv-copy.

My only comment about the commits so far is that the new docstring wording seems a little unwieldy. It seems like we could just say something like "Use a cache for the discovery document"—I think we can trust docstring readers to understand that we implicitly mean a well-behaved cache. If you're cool with my commits, I'm happy for you to merge with any good change there.

Thanks.

#22 Updated by Peter Amstutz over 4 years ago

5037-nonocache LGTM

#23 Updated by Anonymous over 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:1128f6e0d62f71f4ee91ab609c918ae5bb291edd.

Also available in: Atom PDF