Project

General

Profile

Actions

Bug #6600

closed

[SDKs] All PySDK functions that make API calls must support a num_retries argument

Added by Brett Smith over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Bryan Cosca
Category:
SDKs
Target version:
Story points:
2.0

Description

Most of the convenience functions in __init__.py currently don't: current_job, current_task, one_task_per_input_file, task_set_output. Find these methods and extend them to support a num_retries keyword argument with a sensible default.

Resolved edit:
current_job, current_task, task_set_output were all updated.

one_task_per_input_file will be updated in #7660


Subtasks 1 (0 open1 closed)

Task #6891: Review branch 6600-pysdk-execute-retriesResolvedTom Clegg08/05/2015Actions

Related issues 1 (0 open1 closed)

Precedes (1 day) Arvados - Bug #7660: [SDKs] PySDK one_task_per_input_file should support a num_retries argumentClosed08/07/201508/07/2015Actions
Actions #1

Updated by Brett Smith over 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-08-19 sprint
Actions #3

Updated by Brett Smith over 9 years ago

  • Assigned To set to Bryan Cosca
  • Story points set to 3.0
Actions #4

Updated by Bryan Cosca over 9 years ago

So first idea on how to approach this is to install this package: https://pypi.python.org/pypi/retrying.

Then in arvados/sdk/python/arvados/__init__.py, I can add the @retry(stop_max_attempt_number=num_retries), over all methods with an api call. For example:

def task_set_output(self,s):
    api('v1').job_tasks().update(uuid=self['uuid'],
                                 body={
            'output':s,
            'success':True,
            'progress':1.0
            }).execute()

will become

num_retries = 5

...

@retry(stop_max_attempt_number=num_retries)
def task_set_output(self,s):
    api('v1').job_tasks().update(uuid=self['uuid'],
                                 body={
            'output':s,
            'success':True,
            'progress':1.0
            }).execute()

I'm not sure how many retries we should do, I'm just throwing out maybe 5?

Actions #5

Updated by Tom Clegg over 9 years ago

We have some @retry_method stuff already in the python sdk, probably better to continue using that rather than use two different systems. See arvfile.py for some examples.

That should also install the appropriate magic so the caller can override like this:

task_set_output('foo', num_retries=7)

5 sounds like a reasonable default.

The one_task_per_* methods will need to do something a little more subtle: retrying all of the create() tasks every time one fails will result in madness. E.g., make a private _add_task() method for one_task_* to call, and decorate that with @retry_method.

I think it would be ideal to get the num_retries logic into the execute() method. But let's save that until we've figured out how to write tests for the "easy" way.

Actions #6

Updated by Brett Smith over 9 years ago

Tom Clegg wrote:

I think it would be ideal to get the num_retries logic into the execute() method.

At least for the methods that make a single API call, this is the implementation I anticipated when I wrote the story.

I think this is the pattern used elsewhere in the SDK even where we make multiple API calls. I can see the argument that it's not semantically correct, but I do think there's real utility in the naive implementation. If you figure the point of num_retries is to cope with temporary API server connection failures, you're probably willing to make a small number of retries every time you make a new task, even if you're doing that thousands of times. Presumably at some point of few of those requests will hiccup and need to retry, but the rest will work smoothly. The odds that you retry all of them 0 < N < num_retries times seem lower in most real life scenarios.

Actions #7

Updated by Bryan Cosca over 9 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Bryan Cosca over 9 years ago

Just checking if my logic is correct:

For task_set_output, current_task, current_job, it should be like this:

@retry_method
def task_set_output(self,s,num_retries=5):
    api('v1').job_tasks().update(uuid=self['uuid'],
                                 body={
            'output':s,
            'success':True,
            'progress':1.0
            }).execute()

We can do it in the function definition because we don't care if the whole function retries 5 times, because that isn't too computationally expensive, most of them are just updating the api or getting a parameter form the api and then executing.

but for the one_task_per* functions, we do something like this instead:

    @retry_method
    def _add_task(self, num_retries=5):
        return                 

    @staticmethod
    def one_task_per_input_file(if_sequence=0, and_end_task=True, input_as_path=False, api_client=None):
        if if_sequence != current_task()['sequence']:
            return

        if not api_client:
            api_client = api('v1')

        job_input = current_job()['script_parameters']['input']
        cr = CollectionReader(job_input, api_client=api_client)
        cr.normalize()
        for s in cr.all_streams():
            for f in s.all_files():
                if input_as_path:
                    task_input = os.path.join(job_input, s.name(), f.name())
                else:
                    task_input = f.as_manifest()
                new_task_attrs = {
                    'job_uuid': current_job()['uuid'],
                    'created_by_job_task_uuid': current_task()['uuid'],
                    'sequence': if_sequence + 1,
                    'parameters': {
                        'input':task_input
                        }
                    }
                api_client.job_tasks().create(body=new_task_attrs).execute()._add_task()
        if and_end_task:
            api_client.job_tasks().update(uuid=current_task()['uuid'],
                                       body={'success':True}
                                       ).execute()._add_task()
            exit(0)

Is this what you were looking for Tom? I did see the pattern of having num_retries in the execute method, for example:

    @retry_method
    def update(self, other=None, num_retries=None):

          ...

            response = self._my_api().collections().get(uuid=self._manifest_locator).execute(num_retries=num_retries)

but I opted to try your way instead. I'm not exactly sure why we choose one over the other though, if you could clarify that it would be great. I'm also unsure about my _add_task method, I was debating over pass vs return, because I'm not sure what is actually happening with the api_client. I think I understand how decorators work now, but I'm just unsure if what I'm doing is actually right.

Actions #9

Updated by Tom Clegg over 9 years ago

The "retry_method" decorator is really just another way to set a default num_retries: it doesn't actually retry anything.

class Foo(object):
    def __init__(self, num_retries=2):
        self.num_retries = num_retries

    def func1(self, num_retries=None):
        if num_retries is None:
            num_retries = self.num_retries
        print num_retries

    @retry_method
    def func2(self, num_retries=None):
        print num_retries

foo = Foo(num_retries=3)
foo.func1()              # prints 3
foo.func2()              # prints 3
foo.func1(num_retries=4) # prints 4
foo.func2(num_retries=4) # prints 4
In order to actually do anything with the num_retries argument we can use the RetryLoop device, whose features include
  • Backoff timer: wait longer each time we retry (you probably want backoff_start=1, or something, for this case -- not zero)
  • Additional mechanisms for determining success (you probably don't need this: just don't call save_result if the result isn't successful)

http://doc.arvados.org/sdk/python/arvados/arvados.retry.RetryLoop-class.html

Actions #10

Updated by Bryan Cosca over 9 years ago

In 84cc9874760687a061dc489f5a19db66cb06108e:

Comments:

Do we want a backoff_growth? Maybe we want a quick retry at first and then check again in 10 seconds or so (without doing 10 retries) So if we do a backoff_growth of lets say 2x, then the second try happens 1 second later, then third is 2, fourth is 4, etc. I see the cons would be we don't want this to waste a lot of time, but this could be helpful? Could the api go down for 10 seconds then magically come back up? Or is it if the api is down for 10 seconds its probably going to stay down for a while?

This is an example of what I put together with your comments.

@retry_method
def task_set_output(self,s):
    output_retry_loop = RetryLoop(num_retries=5, backoff_start=1)
    for tries_left in output_retry_loop:
        try:
            api('v1').job_tasks().update(uuid=self['uuid'],
                                         body={
                                                'output':s,
                                                'success':True,
                                                'progress':1.0
                                                }).execute()
        except TemporaryError as error:
            logger.debug("Error in task_set_output api call: {} ({} tries left)".format(error,tries_left))
        else:
            output_retry_loop.save_result(result)
    if output_retry_loop.success():
        return output_retry_loop.last_result()

I added RetryLoop to current_task, current_job, and task_set_output with logging and retrying, and waiting 1 second between retries. Also added init method to the job_setup class to deal with num_retries and put them in execute. Waiting for verification to see if that works, or moving to the RetryLoop model.

Actions #11

Updated by Brett Smith over 9 years ago

  • Target version changed from 2015-08-19 sprint to 2015-09-02 sprint
Actions #12

Updated by Brett Smith over 9 years ago

  • Target version changed from 2015-09-02 sprint to 2015-09-16 sprint
Actions #13

Updated by Bryan Cosca over 9 years ago

Branch 6600-pysdk-execute-retries commit c45afe0bb9b7edcfb09f2ef75097159e63a08251 is up for review.

After talking with Tom, we decided that users should be able to retry task_set_output soon, so the story is cut to adding the retry features to task_set_output, current_job, and current_task in init.py.

The tests for these are in
tests/test_task_output_retry.py
tests/test_crunch_job_retry.py
tests/test_crunch_task_retry.py

Actions #14

Updated by Tom Clegg over 9 years ago

Bryan Cosca wrote:

...so the story is cut to adding the retry features to task_set_output, current_job, and current_task in init.py.

IMO this should say "first merge", but not necessarily "story".

At c45afe0

task_set_output() and current_task() should just say num_retries=5 in their argument lists, and delete the "if not num_retries then num_retries=5" bit (unless there's some reason it has to be this way?)

There's a lot of code duplication here: we have two nearly identical retry_loop implementations, each with two nearly identical error handling blocks. Some of the following stuff should help...
  • ApiError is a subclass of HttpError, so it seems like we could drop the "except errors.ApiError as error:" section and still do exactly the same thing. The only difference I see is that the debug log message has the class name in the format text, but I think we can make Python fill in the appropriate class by saying ....format(repr(error)) instead of format(error).
  • The "except" blocks all start with something like if not retry.check_http_response_success(error.resp.status):. If the HTTP response was a success, but we got an exception anyway (perhaps we couldn't parse the response?) the error doesn't get reported. It looks like what we want is to retry in a specific set of circumstances and raise the original exception in all other cases. So perhaps:
    try:
        result = ...
        current_task_retry_loop.save_result(result)
    except errors.HttpError as error:
        if retry.check_http_response_success(error.resp.status) is None and tries_left > 0:
            logger.debug("current_task: job_tasks().get() raised {}, retrying with {} tries left".format(repr(error),tries_left)
        else:
            raise
    

The "else: raise" at the end of the functions looks unreachable: we always re-raise errors on the last iteration of the retry_loop.

The "current_task_retry_loop.last_result()" bits also look like no-ops -- if we succeeded, last_result() returns the value that is already in result.

And if we don't need to use the RetryLoop after we fall out of it, we can get rid of the current_task_retry_loop variable entirely and start the loop like this:
  • for tries_left in RetryLoop(num_retries=num_retries):

(TBC, haven't looked at the test cases yet)

Actions #15

Updated by Bryan Cosca over 9 years ago

At a8688fd

Tom Clegg wrote:

Bryan Cosca wrote:

...so the story is cut to adding the retry features to task_set_output, current_job, and current_task in init.py.

IMO this should say "first merge", but not necessarily "story".

Agreed.

At c45afe0

task_set_output() and current_task() should just say num_retries=5 in their argument lists, and delete the "if not num_retries then num_retries=5" bit (unless there's some reason it has to be this way?)

No reason why it has to be this way, I made the change.

There's a lot of code duplication here: we have two nearly identical retry_loop implementations, each with two nearly identical error handling blocks. Some of the following stuff should help...
  • ApiError is a subclass of HttpError, so it seems like we could drop the "except errors.ApiError as error:" section and still do exactly the same thing. The only difference I see is that the debug log message has the class name in the format text, but I think we can make Python fill in the appropriate class by saying ....format(repr(error)) instead of format(error).
  • The "except" blocks all start with something like if not retry.check_http_response_success(error.resp.status):. If the HTTP response was a success, but we got an exception anyway (perhaps we couldn't parse the response?) the error doesn't get reported. It looks like what we want is to retry in a specific set of circumstances and raise the original exception in all other cases. So perhaps: [...]

OK, I've changed my tests to reflect this.

The "else: raise" at the end of the functions looks unreachable: we always re-raise errors on the last iteration of the retry_loop.

It looks like my tests never raise there so I think its ok to remove that.

The "current_task_retry_loop.last_result()" bits also look like no-ops -- if we succeeded, last_result() returns the value that is already in result.

So, if I use save_result(result), does that mean the result variable gets saved and passed when the loop succeeds?

And if we don't need to use the RetryLoop after we fall out of it, we can get rid of the current_task_retry_loop variable entirely and start the loop like this:
  • [...]

There are a couple uses for this variable, for example, determining success and save_result. If you think it is cleaner to use something else, I'm all ears.

(TBC, haven't looked at the test cases yet)

Actions #16

Updated by Tom Clegg over 9 years ago

Bryan Cosca wrote:

The "else: raise" at the end of the functions looks unreachable: we always re-raise errors on the last iteration of the retry_loop.

It looks like my tests never raise there so I think its ok to remove that.

OK. I think we can take this further:
  • If loop.success() is guaranteed, then there's no need to say if loop.success().
  • If the only thing we do after the loop is "return result" then instead of saying "result = api_client...." inside the loop, we can probably reduce to this
        for tries_left in RetryLoop(num_retries=num_retries, backoff_start=0):
            try:
                return api_client.job_tasks().update(uuid=self['uuid'],
                                                     body={
                                                           'output':s,
                                                           'success':True,
                                                           'progress':1.0
                                                          }).execute()
            except errors.HttpError as error:
                if retry.check_http_response_success(error.status_code) is None and tries_left > 0:
                    logger.debug("task_set_output: job_tasks().update() raised {}, retrying with {} tries left".format(repr(error),tries_left))
                else:
                    raise
    

    ...and that's it, because the only possible paths through the last loop iteration (i.e., when tries_left <= 0) end in "return {...}" (if the update succeeds), throwing an uncaught exception, and re-raising a caught exception after noticing tries_left <= 0.

I think we want backoff_start=2, not zero. According to retry.py, backoff_start=0 (which is the default) disables all waiting, so we'd just try 6 times in quick succession, which wouldn't be long enough for a short API outage. keep.py uses 2 so that's probably reasonable here too. Total sleep time before giving up would be 2+4+8+16+32=62 seconds.

The "current_task_retry_loop.last_result()" bits also look like no-ops -- if we succeeded, last_result() returns the value that is already in result.

So, if I use save_result(result), does that mean the result variable gets saved and passed when the loop succeeds?

Yes, save_result(x) makes loop.success() return True and makes loop.last_result() return x. So this code:
  • loop.save_result(x)
    x = loop.last_result()
    if loop.success():
        print x
    

    ...does nothing more than this code:
  • print x
    
In tests, this:
  •         try:
                arvados._current_job = None
            except:
                raise
    

    is equivalent to this:
            arvados._current_job = None
    
There's a lot of code duplication in the tests -- particularly the current_task and current_job suites are nearly identical. Look at KeepClientRetryTestMixin for a way to deal with this:
  • Make a "ApiRetryTestMixin" class (subclassing object) with a test_* method for every kind of test, and a setUp() method that mocks the API
  • Make a class for each retryable thing (subclassing ApiRetryTestMixin and unittest.Testcase) with a run() method that calls the retryable thing (e.g., current_job())
Actions #17

Updated by Brett Smith over 9 years ago

  • Target version changed from 2015-09-16 sprint to 2015-09-30 sprint
Actions #18

Updated by Brett Smith over 9 years ago

  • Story points changed from 3.0 to 2.0

Adjusting story points for the sprint carry-over.

Actions #19

Updated by Brett Smith about 9 years ago

  • Target version changed from 2015-09-30 sprint to Arvados Future Sprints
Actions #20

Updated by Bryan Cosca about 9 years ago

At dbd30cba

All tests are now in tests/test_init.py

Tom Clegg wrote:

Make a "ApiRetryTestMixin" class (subclassing object) with a test_* method for every kind of test, and a setUp() method that mocks the API
Make a class for each retryable thing (subclassing ApiRetryTestMixin and unittest.Testcase) with a run() method that calls the retryable thing (e.g., current_job())
I think we want backoff_start=2

All done!

Tom Clegg wrote:

ApiError is a subclass of HttpError, so it seems like we could drop the "except errors.ApiError as error:" section and still do exactly the same thing.

It seems that ApiError is being raised when mocking calls using mock_responses and a real API. I've changed init script to reflect this.

Actions #21

Updated by Brett Smith about 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint
Actions #22

Updated by Tom Clegg about 9 years ago

At dbd30cb...

Some whitespace errors to fix (try: "git diff --color --check master...6600-pysdk-execute-retries")

Remove "change this to 2 after tests are finished" comment

In current_task() and current_job(), result isn't a very helpful variable name. How about task?

In arvados_testutil.py you deleted a blank line that probably looked superfluous but should be there according to PEP-8. (Likewise the class defs in test_init.py should be separated by two blank lines.)

The added mock_api_responses() doesn't seem to be used (and it looks a bit dubious: API requests don't go through pycurl...)

test_init.py should be renamed, maybe test_retry_job_helpers.py? (If the best way to describe a test_*.py program is "test everything in foo.py" then it makes sense to call it test_foo.py... but here we're really just testing the retry capability of some specific functions.)

check_exception() accepts an error_class argument but I don't see how this could be used. Should we just replace the two uses of self.check_exception() with self.assertRaises(self.DEFAULT_EXCEPTION, self.run_method)?

check_success() seems to be equivalent to self.run_method (which does nothing, I think you meant self.run_method()), except that it converts an exception to an "assertion failed: True != False", which seems counterproductive. How about just replace self.check_success() with self.run_method()?

Appending a 200 to the 500, 500, 500, 500, 500, 500 responses would be a nice way to confirm that the number of retries is being limited to 5.

There are a few methods that accept (..., *args, **kwargs) but don't use them and (afaics) aren't given any extra args that they need to ignore. If these aren't necessary they should be removed.

These setup bits should happen in a def setUp(self): def tearDown(cls): should be def tearDown(self):. (It makes no functional difference, just a human-readability nit.)
Actions #23

Updated by Bryan Cosca about 9 years ago

  • Description updated (diff)
Actions #24

Updated by Bryan Cosca about 9 years ago

  • Description updated (diff)
Actions #25

Updated by Brett Smith about 9 years ago

  • Status changed from In Progress to Resolved

Per the description: one_task_per_input_file will be handled in #7660. This is a sensible division, because one_task_per_input_file almost always gets called at the beginning of a job, before real compute work begins, meaning the consequences for failure are much smaller. We can prioritize is separately. This story added retry support everywhere it's needed to avoid "job did all its compute work, then suddenly failed trying to record results in Arvados."

Actions

Also available in: Atom PDF