Story #2752

arv-put can quickly resume an interrupted transfer.

Added by Tom Clegg over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Start date:
05/26/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Two main approaches are available. (Eventually we want both. For now, we want whichever is most accessible/efficient from a development perspective.)

HEAD-before-PUT mode
  • Simple to implement reliably
  • Depends on HEAD actually working (might not be true yet when proxy use case is otherwise ready to go)
  • Does not depend on local filesystem features like inode/ctime
  • Does not depend on arv-put running in the same user account (or even host) each time
  • Still requires re-reading the local data
  • Interacts interestingly with Keep permission mechanism (needs some combination of storing partial manifests and caching permission signatures locally)
Local checkpoint
  • Save state (in $HOME/.cache?) while running.
    • Separate cache per arvados_api_host (don't get confused when uploading to two different sites)
    • Be attentive to race conditions (e.g., refuse to run two resumable transfers that would use the same cached data)
    • List of files in order written
    • For each file (at minimum) store name, inode, ctime, size
    • List of blobs successfully written to Keep (including size)
  • When resuming, skip what's already done (unless --no-resume is given).
    • If blob locators have permission signatures, check their expiry times before deciding to re-use. Warn the user (once per arv-put) if blobs are being re-uploaded for this reason.

Subtasks

Task #2864: Review 2752-arv-put-resume-wipResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #2751: Python SDK behaves appropriately when API server advertises a Keep proxy instead of individual Keep storage serversResolved05/15/2014

Associated revisions

Revision 1a37f499 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put ResumeCache constructor only accepts path argument.

Refs #2752. Using exceptions to accept parsed arguments in the
constructor as well was too clever by half.

Revision 13345f28 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put displays a notice when it resumes.

Based on a suggestion from Peter in refs #2752.

Revision ce5813a5 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put displays progress quickly after resuming.

Refs #2752.

Revision 891d6b27 (diff)
Added by Brett Smith over 6 years ago

2752: Adapt Go tests to the new Python test infrastructure.

Refs #2752. There's two pieces to this: pointing the Go tests at the
new home of run_test_server.py, and having run_test_server.py add the
arvados module source to sys.path as needed.

Revision cbd60788 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put explains resumed uploads in more detail.

Wording suggested by Peter in refs #2752.

Revision 475f5ad5 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put ResumeCache constructor only accepts path argument.

Refs #2752. Using exceptions to accept parsed arguments in the
constructor as well was too clever by half.

Revision f9eeccd5 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put displays a notice when it resumes.

Based on a suggestion from Peter in refs #2752.

Revision aa3d77be (diff)
Added by Brett Smith over 6 years ago

2752: arv-put displays progress quickly after resuming.

Refs #2752.

Revision 3c68bb92 (diff)
Added by Brett Smith over 6 years ago

2752: arv-put explains resumed uploads in more detail.

Wording suggested by Peter in refs #2752.

Revision ffe3cdbc
Added by Brett Smith over 6 years ago

Merge branch '2752-arv-put-resume'

Closes #2752, #2864.

History

#1 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 6 years ago

  • Assigned To set to Brett Smith

#3 Updated by Tom Clegg over 6 years ago

  • Description updated (diff)

#4 Updated by Brett Smith over 6 years ago

I just uploaded 2752-arv-put-resume-wip for review. This implements approach #2 in the simplest possible way. Every time arv-put writes data to Keep, it serializes its state to disk. If there's already state available, it will check that for freshness, and resume if everything checks out.

Files on disk are considered unchanged if their mtime and size stay the same. This is the same heuristic that rsync uses by default. arv-put saves every file's full stat data, so we have room to change the heuristic in the future if we wish.

A few other notes that might help the review:

  • Testing infrastructure for our Python modules was moved to a tests subdirectory for a cleaner file tree. Running git diff -M will report those renames without including their full diff. Run tests from a module directory with python -m unittest discover tests
  • a1e2086e moves arv-put's functionality to arvados.commands.put for improved testability. This is also mostly a rename, but I did move code into functions, so you get a huge diff even with -M.

#5 Updated by Peter Amstutz over 6 years ago

  1. General comment: could use more comments outlining what the code is supposed to be doing. The most useful comments capture intention -- "this is what the code is supposed to be doing" -- so that as a reviewer I can check the intention against what the code is actually doing.
  2. ResumeCache constructor -- bad form to use exceptions for expected control flow, much clearer to test isinstance(args, str)
  3. ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.
  4. expected_bytes_for(pathlist) has a section:
      if os.path.isdir(path):
                for filename in arvados.util.listdir_recursive(path):
                    bytesum += os.path.getsize(os.path.join(path, filename))
            elif not os.path.isfile(path):
                return None
            else:
                bytesum += os.path.getsize(path)
    

    Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).

#6 Updated by Peter Amstutz over 6 years ago

  1. 0% bug:
    16M / 33M 0.0%
    
  2. When I interrupt it and then start it again, it sometimes takes a while before the progress line shows up (presumably because it is finishing the current block). It should print the starting progress state immediately.
  3. When resuming, it should print a message to the user like "resuming upload X, use --no-resume if you want to start over"
  4. Displaying bytes per second and estimated time remaining on the progress line would be very nice
  5. A --verbose option that prints each file name as it finishes uploading would be nice
  6. run_test_server.py is broken for external tests (sdk/go/src/arvados.org/keepclient and services/keep/src/arvados.org/keepproxy) that use it to help run integration tests.

#7 Updated by Brett Smith over 6 years ago

Peter Amstutz wrote:

  1. ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.

To capture some of our discussion for yesterday: the second idea is sort of what I started out trying to implement, but it's tricky because what CollectionWriter actually puts in Keep is affected not only by the contents of the files/directories, but also command-line arguments like --filename and --max-manifest-depth. So you have to cache results not only based on the contents of files, but how they're being stored in Keep as well. This seems like it ought to be doable, but figuring out how to do it in a way that would be most useful for the immediate use case in this story seemed like a bigger approach than was called for.

The current approach lets us generate a very specific reason why we're not reusing a cache, which are reflected in the different messages of StaleWriterStateErrors raised. arv-put isn't doing anything with that information currently, but the current architecture leaves our options open. If that work became part of the process of generating a cache filename, you have no way of knowing why an upload didn't resume—all you know is things hashed out differently.

  1. expected_bytes_for(pathlist) has a section:
    [...]
    Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).

Ultimately this result makes it to the progress-reporting functions, which sometimes divide bytes_written by bytes_expected to generate a percentage. Any integer sentinel value we choose would have to be guarded against just as None is: 0 threatens to divide by zero, and anything negative would produce nonsense percentages. Given that there must be a guard, I think None most clearly expresses the intent of "we don't know how many bytes to expect."

  1. Displaying bytes per second and estimated time remaining on the progress line would be very nice
  2. A --verbose option that prints each file name as it finishes uploading would be nice

I think these features both sound cool, but they're not immediately related to this story and it would be preferable to task them out separately.

I've pushed more commits that address your other comments; the branch is currently at 891d6b2. Please take another look. Thanks.

#8 Updated by Brett Smith over 6 years ago

From in-office discussion:

  • Resumed uploads report progress past 100%.
  • Too much performance regression from master.

#9 Updated by Brett Smith over 6 years ago

Brett Smith wrote:

From in-office discussion:

  • Resumed uploads report progress past 100%.
  • Too much performance regression from master.

Both fixed as of 159f578. Ready for another look.

#10 Updated by Peter Amstutz over 6 years ago

$ arv-put golang-1.2.1/
Traceback (most recent call last):
File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
main()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment

#11 Updated by Peter Amstutz over 6 years ago

Using ^C I get at least two different stack traces, we should treat ^C as expected and handle it gracefully.

The first time I run arv-put and hit ^C I get this:

$ arv-put golang-1.2.1/
10M / 33M 29.9% ^CTraceback (most recent call last):
  File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
    main()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 349, in main
    path, max_manifest_depth=args.max_manifest_depth)
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 286, in write_directory_tree
    path, stream_name, max_manifest_depth)
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 264, in write_directory_tree
    self._do_queued_work()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 192, in _do_queued_work
    self._work_dirents()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 218, in _work_dirents
    self.start_new_stream(stream_name)
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 315, in start_new_stream
    self.finish_current_stream()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 329, in finish_current_stream
    self.flush_data()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 265, in flush_data
    super(ArvPutCollectionWriter, self).flush_data()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 280, in flush_data
    self._current_stream_locators += [Keep.put(data_buffer[0:self.KEEP_BLOCK_SIZE])]
  File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 126, in put
    return Keep.global_client_object().put(data, **kwargs)
  File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 485, in put
    t.join()
  File "/usr/lib/python2.7/threading.py", line 949, in join
    self.__block.wait()
  File "/usr/lib/python2.7/threading.py", line 339, in wait
    waiter.acquire()
KeyboardInterrupt

Each subsequent time I hit ^C I get this. In addition, each time it restarts at 10M instead of whatever it was interrupted at.

arv-put golang-1.2.1/
arv-put: Resuming previous upload.  Bypass with the --no-resume option.
10M / 33M 29.9% ^CTraceback (most recent call last):
  File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
    main()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
    writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment

#12 Updated by Tom Clegg over 6 years ago

  • Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch

#13 Updated by Brett Smith over 6 years ago

  • Target version changed from 2014-06-17 Curating and Crunch to 2014-05-28 Pipeline Factory
  • arv-put now installs signal handlers to write state on SIGINT, SIGTERM, and SIGQUIT.
  • State is also cached approximately every 64MiB, so we can recover from pull-the-plug scenarios.

#14 Updated by Brett Smith over 6 years ago

  • Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch

#15 Updated by Peter Amstutz over 6 years ago

Looks good! Almost there. Just a few more tweaks.
  1. Change the resume text to say
    Resuming previous upload from last checkpoint.  Use the --no-resume option to start over.
    
  2. When it finished the upload, I get this:
    33M / 33M 99.9% fd18ec02e561035f1555c368e68d8b76+110336
    
    1. It should show 100%
    2. The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"
  3. It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.

#16 Updated by Brett Smith over 6 years ago

Peter Amstutz wrote:

Looks good! Almost there. Just a few more tweaks.
  1. Change the resume text to say
    [...]

Done.

  1. When it finished the upload, I get this:
    [...]
    1. It should show 100%

Done.

  1. The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"

It's on its own line now. We can't change the output format because tools like crunch-job parse it as is.

  1. It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.

Per discussion on IRC: this sort of caching would be a cool feature, but it's not what was specced for this story. This is about resuming interrupted uploads. Deleting the state file is helpful because it avoids messages about restarting uploads that successfully finished. It also follows the philosophy of "there was no interruption, so there's nothing to resume."

Changes pushed to cbd6078.

#17 Updated by Brett Smith over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:ffe3cdbc8c37e2b4a4e3ea4f67c1c9ca5d81e2ed.

Also available in: Atom PDF