Idea #2752: arv-put can quickly resume an interrupted transfer. - Arvados

Actions

Copy link

Idea #2752

closed

arv-put can quickly resume an interrupted transfer.

Added by Tom Clegg almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Target version:

2014-06-17 Curating and Crunch

Start date:

05/26/2014

Due date:

Story points:

2.0

Description

Two main approaches are available. (Eventually we want both. For now, we want whichever is most accessible/efficient from a development perspective.)

HEAD-before-PUT mode

Simple to implement reliably
Depends on HEAD actually working (might not be true yet when proxy use case is otherwise ready to go)
Does not depend on local filesystem features like inode/ctime
Does not depend on arv-put running in the same user account (or even host) each time
Still requires re-reading the local data
Interacts interestingly with Keep permission mechanism (needs some combination of storing partial manifests and caching permission signatures locally)

Local checkpoint

Save state (in $HOME/.cache?) while running.
- Separate cache per arvados_api_host (don't get confused when uploading to two different sites)
- Be attentive to race conditions (e.g., refuse to run two resumable transfers that would use the same cached data)
- List of files in order written
- For each file (at minimum) store name, inode, ctime, size
- List of blobs successfully written to Keep (including size)
When resuming, skip what's already done (unless --no-resume is given).
- If blob locators have permission signatures, check their expiry times before deciding to re-use. Warn the user (once per arv-put) if blobs are being re-uploaded for this reason.

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Brett Smith almost 11 years ago

Assigned To set to Brett Smith

Actions

Copy link

Updated by Tom Clegg almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Brett Smith almost 11 years ago

I just uploaded 2752-arv-put-resume-wip for review. This implements approach #2 in the simplest possible way. Every time arv-put writes data to Keep, it serializes its state to disk. If there's already state available, it will check that for freshness, and resume if everything checks out.

Files on disk are considered unchanged if their mtime and size stay the same. This is the same heuristic that rsync uses by default. arv-put saves every file's full stat data, so we have room to change the heuristic in the future if we wish.

A few other notes that might help the review:

Testing infrastructure for our Python modules was moved to a tests subdirectory for a cleaner file tree. Running git diff -M will report those renames without including their full diff. Run tests from a module directory with python -m unittest discover tests
a1e2086e moves arv-put's functionality to arvados.commands.put for improved testability. This is also mostly a rename, but I did move code into functions, so you get a huge diff even with -M.

Actions

Copy link

Updated by Peter Amstutz almost 11 years ago

General comment: could use more comments outlining what the code is supposed to be doing. The most useful comments capture intention -- "this is what the code is supposed to be doing" -- so that as a reviewer I can check the intention against what the code is actually doing.
ResumeCache constructor -- bad form to use exceptions for expected control flow, much clearer to test isinstance(args, str)
ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.

expected_bytes_for(pathlist) has a section:

  if os.path.isdir(path):
            for filename in arvados.util.listdir_recursive(path):
                bytesum += os.path.getsize(os.path.join(path, filename))
        elif not os.path.isfile(path):
            return None
        else:
            bytesum += os.path.getsize(path)

Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).

Actions

Copy link

Updated by Peter Amstutz almost 11 years ago

0% bug:
```
16M / 33M 0.0%
```
When I interrupt it and then start it again, it sometimes takes a while before the progress line shows up (presumably because it is finishing the current block). It should print the starting progress state immediately.
When resuming, it should print a message to the user like "resuming upload X, use --no-resume if you want to start over"
Displaying bytes per second and estimated time remaining on the progress line would be very nice
A --verbose option that prints each file name as it finishes uploading would be nice
run_test_server.py is broken for external tests (sdk/go/src/arvados.org/keepclient and services/keep/src/arvados.org/keepproxy) that use it to help run integration tests.

Actions

Copy link

Updated by Brett Smith almost 11 years ago

Peter Amstutz wrote:

ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.

To capture some of our discussion for yesterday: the second idea is sort of what I started out trying to implement, but it's tricky because what CollectionWriter actually puts in Keep is affected not only by the contents of the files/directories, but also command-line arguments like --filename and --max-manifest-depth. So you have to cache results not only based on the contents of files, but how they're being stored in Keep as well. This seems like it ought to be doable, but figuring out how to do it in a way that would be most useful for the immediate use case in this story seemed like a bigger approach than was called for.

The current approach lets us generate a very specific reason why we're not reusing a cache, which are reflected in the different messages of StaleWriterStateErrors raised. arv-put isn't doing anything with that information currently, but the current architecture leaves our options open. If that work became part of the process of generating a cache filename, you have no way of knowing why an upload didn't resume—all you know is things hashed out differently.

expected_bytes_for(pathlist) has a section:
[...]
Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).

Ultimately this result makes it to the progress-reporting functions, which sometimes divide bytes_written by bytes_expected to generate a percentage. Any integer sentinel value we choose would have to be guarded against just as None is: 0 threatens to divide by zero, and anything negative would produce nonsense percentages. Given that there must be a guard, I think None most clearly expresses the intent of "we don't know how many bytes to expect."

Displaying bytes per second and estimated time remaining on the progress line would be very nice

A --verbose option that prints each file name as it finishes uploading would be nice

I think these features both sound cool, but they're not immediately related to this story and it would be preferable to task them out separately.

I've pushed more commits that address your other comments; the branch is currently at 891d6b2. Please take another look. Thanks.

Actions

Copy link

Updated by Brett Smith almost 11 years ago

From in-office discussion:

Resumed uploads report progress past 100%.
Too much performance regression from master.

Actions

Copy link

Updated by Brett Smith almost 11 years ago

Brett Smith wrote:

From in-office discussion:

Resumed uploads report progress past 100%.

Too much performance regression from master.

Both fixed as of 159f578. Ready for another look.

Actions

Copy link

#10

Updated by Peter Amstutz almost 11 years ago

$ arv-put golang-1.2.1/
Traceback (most recent call last):
File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
main()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment

Actions

Copy link

#11

Updated by Peter Amstutz almost 11 years ago

Using ^C I get at least two different stack traces, we should treat ^C as expected and handle it gracefully.

The first time I run arv-put and hit ^C I get this:

$ arv-put golang-1.2.1/
10M / 33M 29.9% ^CTraceback (most recent call last):
  File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
    main()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 349, in main
    path, max_manifest_depth=args.max_manifest_depth)
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 286, in write_directory_tree
    path, stream_name, max_manifest_depth)
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 264, in write_directory_tree
    self._do_queued_work()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 192, in _do_queued_work
    self._work_dirents()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 218, in _work_dirents
    self.start_new_stream(stream_name)
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 315, in start_new_stream
    self.finish_current_stream()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 329, in finish_current_stream
    self.flush_data()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 265, in flush_data
    super(ArvPutCollectionWriter, self).flush_data()
  File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 280, in flush_data
    self._current_stream_locators += [Keep.put(data_buffer[0:self.KEEP_BLOCK_SIZE])]
  File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 126, in put
    return Keep.global_client_object().put(data, **kwargs)
  File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 485, in put
    t.join()
  File "/usr/lib/python2.7/threading.py", line 949, in join
    self.__block.wait()
  File "/usr/lib/python2.7/threading.py", line 339, in wait
    waiter.acquire()
KeyboardInterrupt

Each subsequent time I hit ^C I get this. In addition, each time it restarts at 10M instead of whatever it was interrupted at.

arv-put golang-1.2.1/
arv-put: Resuming previous upload.  Bypass with the --no-resume option.
10M / 33M 29.9% ^CTraceback (most recent call last):
  File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
    main()
  File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
    writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment

Actions

Copy link

#12

Updated by Tom Clegg almost 11 years ago

Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch

Actions

Copy link

#13

Updated by Brett Smith almost 11 years ago

Target version changed from 2014-06-17 Curating and Crunch to 2014-05-28 Pipeline Factory

arv-put now installs signal handlers to write state on SIGINT, SIGTERM, and SIGQUIT.
State is also cached approximately every 64MiB, so we can recover from pull-the-plug scenarios.

Actions

Copy link

#14

Updated by Brett Smith almost 11 years ago

Target version changed from 2014-05-28 Pipeline Factory to 2014-06-17 Curating and Crunch

Actions

Copy link

#15

Updated by Peter Amstutz almost 11 years ago

Looks good! Almost there. Just a few more tweaks.

Change the resume text to say

Resuming previous upload from last checkpoint.  Use the --no-resume option to start over.

When it finished the upload, I get this:
```
33M / 33M 99.9% fd18ec02e561035f1555c368e68d8b76+110336
```
1. It should show 100%
2. The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"
It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.

Actions

Copy link

#16

Updated by Brett Smith almost 11 years ago

Peter Amstutz wrote:

Looks good! Almost there. Just a few more tweaks.

Change the resume text to say
[...]

Done.

When it finished the upload, I get this:
[...]

It should show 100%

Done.

The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"

It's on its own line now. We can't change the output format because tools like crunch-job parse it as is.

It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.

Per discussion on IRC: this sort of caching would be a cool feature, but it's not what was specced for this story. This is about resuming interrupted uploads. Deleting the state file is helpful because it avoids messages about restarting uploads that successfully finished. It also follows the philosophy of "there was no interruption, so there's nothing to resume."

Changes pushed to cbd6078.

Actions

Copy link

#17

Updated by Brett Smith almost 11 years ago

Status changed from New to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:ffe3cdbc8c37e2b4a4e3ea4f67c1c9ca5d81e2ed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #2752

arv-put can quickly resume an interrupted transfer.

Updated by Tom Clegg almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Tom Clegg almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Peter Amstutz almost 11 years ago

Updated by Peter Amstutz almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Peter Amstutz almost 11 years ago

Updated by Peter Amstutz almost 11 years ago

Updated by Tom Clegg almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Peter Amstutz almost 11 years ago

Updated by Brett Smith almost 11 years ago

Updated by Brett Smith almost 11 years ago