Idea #2752
closedarv-put can quickly resume an interrupted transfer.
Description
Two main approaches are available. (Eventually we want both. For now, we want whichever is most accessible/efficient from a development perspective.)
HEAD-before-PUT mode- Simple to implement reliably
- Depends on HEAD actually working (might not be true yet when proxy use case is otherwise ready to go)
- Does not depend on local filesystem features like inode/ctime
- Does not depend on arv-put running in the same user account (or even host) each time
- Still requires re-reading the local data
- Interacts interestingly with Keep permission mechanism (needs some combination of storing partial manifests and caching permission signatures locally)
- Save state (in
$HOME/.cache
?) while running.- Separate cache per arvados_api_host (don't get confused when uploading to two different sites)
- Be attentive to race conditions (e.g., refuse to run two resumable transfers that would use the same cached data)
- List of files in order written
- For each file (at minimum) store name, inode, ctime, size
- List of blobs successfully written to Keep (including size)
- When resuming, skip what's already done (unless
--no-resume
is given).- If blob locators have permission signatures, check their expiry times before deciding to re-use. Warn the user (once per arv-put) if blobs are being re-uploaded for this reason.
Updated by Brett Smith over 10 years ago
I just uploaded 2752-arv-put-resume-wip for review. This implements approach #2 in the simplest possible way. Every time arv-put writes data to Keep, it serializes its state to disk. If there's already state available, it will check that for freshness, and resume if everything checks out.
Files on disk are considered unchanged if their mtime and size stay the same. This is the same heuristic that rsync uses by default. arv-put saves every file's full stat data, so we have room to change the heuristic in the future if we wish.
A few other notes that might help the review:
- Testing infrastructure for our Python modules was moved to a
tests
subdirectory for a cleaner file tree. Runninggit diff -M
will report those renames without including their full diff. Run tests from a module directory withpython -m unittest discover tests
- a1e2086e moves arv-put's functionality to arvados.commands.put for improved testability. This is also mostly a rename, but I did move code into functions, so you get a huge diff even with -M.
Updated by Peter Amstutz over 10 years ago
- General comment: could use more comments outlining what the code is supposed to be doing. The most useful comments capture intention -- "this is what the code is supposed to be doing" -- so that as a reviewer I can check the intention against what the code is actually doing.
- ResumeCache constructor -- bad form to use exceptions for expected control flow, much clearer to test isinstance(args, str)
- ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.
- expected_bytes_for(pathlist) has a section:
if os.path.isdir(path): for filename in arvados.util.listdir_recursive(path): bytesum += os.path.getsize(os.path.join(path, filename)) elif not os.path.isfile(path): return None else: bytesum += os.path.getsize(path)
Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).
Updated by Peter Amstutz over 10 years ago
- 0% bug:
16M / 33M 0.0%
- When I interrupt it and then start it again, it sometimes takes a while before the progress line shows up (presumably because it is finishing the current block). It should print the starting progress state immediately.
- When resuming, it should print a message to the user like "resuming upload X, use --no-resume if you want to start over"
- Displaying bytes per second and estimated time remaining on the progress line would be very nice
- A --verbose option that prints each file name as it finishes uploading would be nice
- run_test_server.py is broken for external tests (sdk/go/src/arvados.org/keepclient and services/keep/src/arvados.org/keepproxy) that use it to help run integration tests.
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
- ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.
To capture some of our discussion for yesterday: the second idea is sort of what I started out trying to implement, but it's tricky because what CollectionWriter actually puts in Keep is affected not only by the contents of the files/directories, but also command-line arguments like --filename and --max-manifest-depth. So you have to cache results not only based on the contents of files, but how they're being stored in Keep as well. This seems like it ought to be doable, but figuring out how to do it in a way that would be most useful for the immediate use case in this story seemed like a bigger approach than was called for.
The current approach lets us generate a very specific reason why we're not reusing a cache, which are reflected in the different messages of StaleWriterStateErrors raised. arv-put isn't doing anything with that information currently, but the current architecture leaves our options open. If that work became part of the process of generating a cache filename, you have no way of knowing why an upload didn't resume—all you know is things hashed out differently.
- expected_bytes_for(pathlist) has a section:
[...]
Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).
Ultimately this result makes it to the progress-reporting functions, which sometimes divide bytes_written by bytes_expected to generate a percentage. Any integer sentinel value we choose would have to be guarded against just as None is: 0 threatens to divide by zero, and anything negative would produce nonsense percentages. Given that there must be a guard, I think None most clearly expresses the intent of "we don't know how many bytes to expect."
- Displaying bytes per second and estimated time remaining on the progress line would be very nice
- A --verbose option that prints each file name as it finishes uploading would be nice
I think these features both sound cool, but they're not immediately related to this story and it would be preferable to task them out separately.
I've pushed more commits that address your other comments; the branch is currently at 891d6b2. Please take another look. Thanks.
Updated by Brett Smith over 10 years ago
From in-office discussion:
- Resumed uploads report progress past 100%.
- Too much performance regression from master.
Updated by Brett Smith over 10 years ago
Brett Smith wrote:
From in-office discussion:
- Resumed uploads report progress past 100%.
- Too much performance regression from master.
Both fixed as of 159f578. Ready for another look.
Updated by Peter Amstutz over 10 years ago
$ arv-put golang-1.2.1/
Traceback (most recent call last):
File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
main()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment
Updated by Peter Amstutz over 10 years ago
Using ^C I get at least two different stack traces, we should treat ^C as expected and handle it gracefully.
The first time I run arv-put and hit ^C I get this:
$ arv-put golang-1.2.1/ 10M / 33M 29.9% ^CTraceback (most recent call last): File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module> main() File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 349, in main path, max_manifest_depth=args.max_manifest_depth) File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 286, in write_directory_tree path, stream_name, max_manifest_depth) File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 264, in write_directory_tree self._do_queued_work() File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 192, in _do_queued_work self._work_dirents() File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 218, in _work_dirents self.start_new_stream(stream_name) File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 315, in start_new_stream self.finish_current_stream() File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 329, in finish_current_stream self.flush_data() File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 265, in flush_data super(ArvPutCollectionWriter, self).flush_data() File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 280, in flush_data self._current_stream_locators += [Keep.put(data_buffer[0:self.KEEP_BLOCK_SIZE])] File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 126, in put return Keep.global_client_object().put(data, **kwargs) File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 485, in put t.join() File "/usr/lib/python2.7/threading.py", line 949, in join self.__block.wait() File "/usr/lib/python2.7/threading.py", line 339, in wait waiter.acquire() KeyboardInterrupt
Each subsequent time I hit ^C I get this. In addition, each time it restarts at 10M instead of whatever it was interrupted at.
arv-put golang-1.2.1/ arv-put: Resuming previous upload. Bypass with the --no-resume option. 10M / 33M 29.9% ^CTraceback (most recent call last): File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module> main() File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main writer.cache_state() UnboundLocalError: local variable 'writer' referenced before assignment
Updated by Tom Clegg over 10 years ago
- Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch
Updated by Brett Smith over 10 years ago
- Target version changed from 2014-06-17 Curating and Crunch to 2014-05-28 Pipeline Factory
- arv-put now installs signal handlers to write state on SIGINT, SIGTERM, and SIGQUIT.
- State is also cached approximately every 64MiB, so we can recover from pull-the-plug scenarios.
Updated by Brett Smith over 10 years ago
- Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch
Updated by Peter Amstutz over 10 years ago
- Change the resume text to say
Resuming previous upload from last checkpoint. Use the --no-resume option to start over.
- When it finished the upload, I get this:
33M / 33M 99.9% fd18ec02e561035f1555c368e68d8b76+110336
- It should show 100%
- The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"
- It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.
Updated by Brett Smith over 10 years ago
Peter Amstutz wrote:
Looks good! Almost there. Just a few more tweaks.
- Change the resume text to say
[...]
Done.
- When it finished the upload, I get this:
[...]
- It should show 100%
Done.
- The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"
It's on its own line now. We can't change the output format because tools like crunch-job parse it as is.
- It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.
Per discussion on IRC: this sort of caching would be a cool feature, but it's not what was specced for this story. This is about resuming interrupted uploads. Deleting the state file is helpful because it avoids messages about restarting uploads that successfully finished. It also follows the philosophy of "there was no interruption, so there's nothing to resume."
Changes pushed to cbd6078.
Updated by Brett Smith over 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:ffe3cdbc8c37e2b4a4e3ea4f67c1c9ca5d81e2ed.