Story #10383

Support incremental upload (rsync style) for arv-put

Added by Tom Morris almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/26/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Be able to redo an upload of a directory which has changed and only upload the new and changed files.

Functionality:

  • If a local file doesn't exist in the remote collection, upload it.
  • If a local file is different from a corresponding remote file, replace the remote collection with the local file contents.
    • As a special case, if a local file has been partially uploaded, try to resume upload and append remote file contents instead of starting from 0.
  • If a file exists remotely and has been deleted locally, the remote file is left unchanged (left for future feature).

In "resume" mode, the data is uploaded to Keep and checkpointed to cache. When upload is complete, a new collection is created.

In "update" mode, the data is uploaded to Keep and checkpointed to both the cache and the remote collection.

In both modes, read and record a checkpoint file which stores metadata about files that have been partially or completely uploaded.

When performing a resume or update, compare checkpointed file metadata to the remote collection and skip files which are unchanged.

Use ArvFile.manifest_text() to get a valid manifest for each individual file for the purposes of comparing content.


Subtasks

Task #10384: Review 10383-arv-put-incremental-uploadResolvedPeter Amstutz

Associated revisions

Revision f2e2487d
Added by Lucas Di Pentima almost 4 years ago

Merge branch '10383-arv-put-incremental-upload'
Closes #10383

Revision ad0712cc
Added by Lucas Di Pentima almost 4 years ago

Merge branch '10383-arv-put-incremental-upload'
Refs #10383

History

#1 Updated by Lucas Di Pentima almost 4 years ago

  • Story points set to 1.0

#2 Updated by Lucas Di Pentima almost 4 years ago

  • Status changed from New to In Progress

#3 Updated by Lucas Di Pentima almost 4 years ago

Branch 10383-arv-put-incremental-upload at 391a1d8 is ready for review.

arv-put command accepts a new option --update-collection <collection locator>, and it will only upload those files that don't exist on the named collection.

Tests being run at: https://ci.curoverse.com/job/developer-run-tests/57/

#4 Updated by Peter Amstutz almost 4 years ago

  • If you use --update-collection with keep_locator_pattern (a hash instead of a uuid), you need to use collection.save_new().
  • I don't understand why we would want to skip a local file that is suspected to be different from the remote file? If the purpose of this feature is to run arv-put repeatedly against a directory that is being written to in real time, it may upload files which are only partially written?
                # If file already exist and we are updating the collection, ignore it
                # even if it's different from the local one.
                if self.update_collection:
                    self.bytes_skipped += file_in_collection.size()
                    return
  • I'm not sure the extra check is necessary, the file in the collection should already exist as an empty file when it is opened for writing:
            # Allow an empty file to be written
            if not data and not first_read:
                break
            if first_read:
                first_read = False
            output.write(data)

#5 Updated by Tom Morris almost 4 years ago

  • Description updated (diff)

#6 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#7 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#8 Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)

#9 Updated by Peter Amstutz almost 4 years ago

Additional notes:

  1. Read the manifest from the checkpoint file and create a Collection object. Also read file metadata.
  2. Walk the files on disk and get the size and timestamp for each one
  3. For each file, check if it is present in the checkpoint metadata. If not present in the checkpoint, add the file to the upload list.
  4. If the file is present in the checkpoint metadata, but the file metadata does not match the checkpoint metadata (size or mtime are different) then add it to the the upload list.
  5. If the file is present in the checkpoint metadata, and the file metadata does match the checkpoint metadata, but the file in the collection is smaller than the file on disk, then add the file to the resume list (append the missing remaining data)

In "update" mode, keep the same behavior as above, but also create a Collection object for the remote collection record. Compare each file between the collection object from the checkpoint (can just use == instead of manifest_text()) and the remote collection. Any file which is different between the local checkpoint manifest and the remote collection is added to the upload list.

#10 Updated by Lucas Di Pentima almost 4 years ago

  • Target version changed from 2016-11-09 sprint to 2016-11-23 sprint

#11 Updated by Lucas Di Pentima almost 4 years ago

  • Story points changed from 1.0 to 0.5

#12 Updated by Lucas Di Pentima almost 4 years ago

Updates pushed at 3885bc2e043ce123929acceaaa8dbbf20dbb8c12

Tests being run at: https://ci.curoverse.com/job/developer-run-tests/66/

  • The cache file saves the collection UUID when uploading for the first time.
  • On initial upload success, the cache file is not deleted, so it can be used for updating the collection.
  • Once the initial collection is done, the manifest saved in the cache file is used to detect any remote file changes, so the local file is then re-uploaded completely is necessary.
  • If there weren't remote changes from the last upload/update, new files are uploaded and partial uploads resumed.
  • If a new upload (to create a new collection) is attempted without deleting the previous cache file, a message will alert the user and the operation will be cancelled.

Because the remote collection UUID is saved on the cache file, we could allow the user to start an update without the need to specify it, and just using the --update-collection flag.

#13 Updated by Lucas Di Pentima almost 4 years ago

Some sdk/cli tests were failing because arv-put was called repeatedly on the same directory by different tests, so the same resume cache file was used. The problem is caused when arv-put finishes an upload with --manifest: it "binds" that cache file to the saved collection and won't run to upload to another one unless --no-resume is used or the cache file is deleted.

I added --no-resume to the failing tests, because they don't test that feature.

a3d4559

New test suite run at: https://ci.curoverse.com/job/developer-run-tests/67/

#14 Updated by Peter Amstutz almost 4 years ago

  • I'm not sure why it is necessary to save (and check) the collection uuid? The rsync behavior should work for any destination collection.
  • Ideally, the only difference between "update" and "resume" should be that the first one updates an existing collection, and the second one create a new collection.
  • Should --update-collection, --resume and --no-resume all be mutually exclusive options?
  • In _write_file is it necessary to wrap most of the behavior in if self.resume? When resume is False it should always write the file because the cache is empty.
  • Similarly, most of the if self.update tests should be unnecessary if the conditions for resume/no-resume/update are already set up.
  • (Counter argument, checking flags may avoid some unnecessary work, but it makes the code harder to follow)
  • I don't think we want to do this, this would mean any change to the file which increases the file size will be assumed to be an append (this defeats the purpose of using mtime to detect file changes):
    elif cached_file_data['mtime'] < os.path.getmtime(source) and cached_file_data['size'] < os.path.getsize(source):
    
  • In _write, it is not necessary to do a zero-length write in order to create an empty file (the first_read flag). (That may have been true for the old CollectionWriter but is not true for the current Collection class).
  • The Collection class is already thread safe, so some instances of with self._collection_lock may be unecessary (particularly when only calling a single method on the Collection object).
  • Why do deferred initialization with _my_collection() when the Collection only needs to be set up once, perhaps in _setup_state() ?
  • Suggest using os.walk instead of calling os.listdir recursively. Also, it needs to sort the list of files and directories so they are added in a deterministic order.
  • I think there should be a distinction between starting with an empty cache (--no-resume) and not saving a cache (--no-cache?)

If possible I would like to see the "resume" and "update" flags on ArvPutUploadJob go away, so there is just one code path determining what to upload based on the source directory, cache, and destination collection. I believe this will help simplify the code in a number of places and make it clearer what is supposed to happen.

#15 Updated by Lucas Di Pentima almost 4 years ago

Changes at: 2225be2
Tests at: https://ci.curoverse.com/job/developer-run-tests/78/

Sorry Peter about the previous mess, it got unnecessarily too complex because I modified the first version instead of re-writing it. Also, I got to understand more clearly the idea behind this story so this version is a lot simpler and more feature full.

  • Now --update-collection won't bind a cache file to an existing collection, the use will be able to update any existing collection.
  • The options --update-collection and --resume can now be used together as they offer independent features.
  • Added a new option --no-cache to avoid creating or using an existing cache file.
  • Enhanced the cache file index to use absolute paths so that using the command from different CWD won't affect its logic.
  • Removed unnecessary uses of self._collection_lock.
  • Simplified & fixed the upload/resume decision making logic.
  • Use of os.walk() to traverse directories, in a way that's deterministic even though os.walk() is a generator.
  • Eliminated superfluous code when writing an empty file.

#16 Updated by Lucas Di Pentima almost 4 years ago

  • Target version changed from 2016-11-23 sprint to 2016-12-14 sprint

#17 Updated by Peter Amstutz almost 4 years ago

Please use a logger instead of print >>stderr.

The option --max-manifest-depth is obsolete and should be marked as such (and also removed from the md5 computation in make_path). We don't want to remove it to avoid breaking backwards compatability, but it might make sense to hide it from the help text.

The update_time for checkpointing seems too frequent. Suggest larger period, maybe every 20 seconds.

In save_collection, before doing self._remote_collection.copy we need to check that signing tokens from the local manifest are not about to expire. If the tokens are expired, the files need to be re-uploaded (unless/until we have an API for refreshing tokens).

I started uploading a file, then hit ^C. I did arv-put again and it resumed. So far, so good. Then I did "touch" on the file and did arv-put again. I expected it would restart the upload, because a change in mtime is assumed to be a change in file content. However, it continued from where it left off. This is a bug.

It also resumes even when --no-resume is provided! It looks like it is failing to delete the cache when resume is False.

--no-cache also appears to be resuming (?)

I would expect that:

--resume means use the cache if available
--no-resume means delete the cache and start over
--cache means read and write a cache
--no-cache means don't read or write a cache

#18 Updated by Lucas Di Pentima almost 4 years ago

Tests at: https://ci.curoverse.com/job/developer-run-tests/96/
Updates: 5646f94

Peter Amstutz wrote:

Please use a logger instead of print >>stderr.

Done.

The option --max-manifest-depth is obsolete and should be marked as such (and also removed from the md5 computation in make_path). We don't want to remove it to avoid breaking backwards compatability, but it might make sense to hide it from the help text.

Updated. Argparse has a feature for cases like this, the argument is accepted but it isn't listed on the help message.

The update_time for checkpointing seems too frequent. Suggest larger period, maybe every 20 seconds.

Updated to 20 secs.

In save_collection, before doing self._remote_collection.copy we need to check that signing tokens from the local manifest are not about to expire. If the tokens are expired, the files need to be re-uploaded (unless/until we have an API for refreshing tokens).

Done. For this I added a permission_expired method to ArvadosFile class that check all its segments's expirations, with it's own test case.

I started uploading a file, then hit ^C. I did arv-put again and it resumed. So far, so good. Then I did "touch" on the file and did arv-put again. I expected it would restart the upload, because a change in mtime is assumed to be a change in file content. However, it continued from where it left off. This is a bug.
It also resumes even when --no-resume is provided! It looks like it is failing to delete the cache when resume is False.
--no-cache also appears to be resuming (?)

The issue was that the command was reporting its progress wrongly, and that was because instead of checking all files to be uploaded/resumed first, it was doing it file by file, so in case of a resume, the total amount of uploaded bytes were not available until the command finished. Because the real upload is done asynchronously, the bytes_written value is taken from the local_collection's size, so it was showing 100% uploaded for example when "touching" a file to re-upload it, because bytes_written was taken from the previous local collection before overwriting the changed file.

I changed this so the progress is shown correctly on upload and resume by going through the entire file list, deciding if it need to be skipped, uploaded or resumed and doing the proper accounting before starting the real upload process.

--no-resume means delete the cache and start over

Updated.

Also added some tests to confirm that --no-resume and --no-cache work as intended.

#19 Updated by Peter Amstutz almost 4 years ago

I'm getting an error every time I run it:

$ arv-put .
0 2016-12-12 16:01:59 arvados.arv_put[32544] ERROR: 

962eh-4zz18-f6k8xl2558krll0
$ find
.
./b1
./foo1
./foo1/blah
./foo2
./foo2/blah2

These files are all zero-length, which might be trigging an edge case.

It seems to be uploading the collection correctly, so the bug may just be in the non-error error being logged.

#20 Updated by Lucas Di Pentima almost 4 years ago

Test run: https://ci.curoverse.com/job/developer-run-tests/97/
Updates at: b5af634

  • Fixed the non-error error message.
  • Also added --dry-run argument so that it checks if there're files pending for upload (in which case it exits with return code 2 instead of 0).
  • Added some tests for this new argument.

#21 Updated by Peter Amstutz almost 4 years ago

112.0% ? oops.

$ arv-put .
7M / 6M 112.0% 962eh-4zz18-dkn2si9w0nic8rw

#22 Updated by Peter Amstutz almost 4 years ago

Also, --dry-run should not print anything.

$ arv-put --dry-run --update 962eh-4zz18-dkn2si9w0nic8rw .
7M / 6M 112.0% 962eh-4zz18-dkn2si9w0nic8rw

#23 Updated by Lucas Di Pentima almost 4 years ago

  • Target version changed from 2016-12-14 sprint to 2017-01-04 sprint

#24 Updated by Lucas Di Pentima almost 4 years ago

Peter Amstutz wrote:

112.0% ? oops.

[...]

The problem was happening when some uploaded file was deleted from the local disk previous to an arv-put re-run: the collection from the cache file manifest (aka: the local_collection) is the source of truth when asking how many bytes we're uploading, so in those cases when that collection is bigger than the local files, the error occurred. Now I added some code to delete those excess files from the local_collection, correcting the issue.
If your case was different, please explain how you got a >100% progress value.

Also, made some fixes to the dry-run code so it doesn't print anything to stdout and actually passes the args.dry_run to the ArvPutUploadJob class.

Test run: https://ci.curoverse.com/job/developer-run-tests/101/
Updates: 55e9587e140e1621dcd374afcf30ee0aec2c6d24

#25 Updated by Peter Amstutz almost 4 years ago

It is correctly reporting 100% now.

I have one more request and then LGTM:

arv-put --name "new name" --update-collection 962eh-4zz18-b41l73mftcoqpkm python

→ should update the collection name to "new name" (currently it doesn't change the collection name).

#26 Updated by Lucas Di Pentima almost 4 years ago

As we talked about avoiding further scope creep, I've made a small update so that arv-put don't allow the use of --name together with --update-collection, exiting with an error message.
Also I merged master into the branch and started a complete test run.

Test run: https://ci.curoverse.com/job/developer-run-tests/104/
Updates: dcca309

#27 Updated by Lucas Di Pentima almost 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:f2e2487ddba3944d3acd46fae3424a87fc624be9.

Also available in: Atom PDF