Idea #10383
closedSupport incremental upload (rsync style) for arv-put
Description
Be able to redo an upload of a directory which has changed and only upload the new and changed files.
Functionality:
- If a local file doesn't exist in the remote collection, upload it.
- If a local file is different from a corresponding remote file, replace the remote collection with the local file contents.
- As a special case, if a local file has been partially uploaded, try to resume upload and append remote file contents instead of starting from 0.
- If a file exists remotely and has been deleted locally, the remote file is left unchanged (left for future feature).
In "resume" mode, the data is uploaded to Keep and checkpointed to cache. When upload is complete, a new collection is created.
In "update" mode, the data is uploaded to Keep and checkpointed to both the cache and the remote collection.
In both modes, read and record a checkpoint file which stores metadata about files that have been partially or completely uploaded.
When performing a resume or update, compare checkpointed file metadata to the remote collection and skip files which are unchanged.
Use ArvFile.manifest_text() to get a valid manifest for each individual file for the purposes of comparing content.
Updated by Lucas Di Pentima about 8 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima about 8 years ago
Branch 10383-arv-put-incremental-upload
at 391a1d8 is ready for review.
arv-put
command accepts a new option --update-collection <collection locator>
, and it will only upload those files that don't exist on the named collection.
Tests being run at: https://ci.curoverse.com/job/developer-run-tests/57/
Updated by Peter Amstutz about 8 years ago
- If you use --update-collection with keep_locator_pattern (a hash instead of a uuid), you need to use collection.save_new().
- I don't understand why we would want to skip a local file that is suspected to be different from the remote file? If the purpose of this feature is to run arv-put repeatedly against a directory that is being written to in real time, it may upload files which are only partially written?
# If file already exist and we are updating the collection, ignore it # even if it's different from the local one. if self.update_collection: self.bytes_skipped += file_in_collection.size() return
- I'm not sure the extra check is necessary, the file in the collection should already exist as an empty file when it is opened for writing:
# Allow an empty file to be written if not data and not first_read: break if first_read: first_read = False output.write(data)
Updated by Peter Amstutz about 8 years ago
Additional notes:
- Read the manifest from the checkpoint file and create a Collection object. Also read file metadata.
- Walk the files on disk and get the size and timestamp for each one
- For each file, check if it is present in the checkpoint metadata. If not present in the checkpoint, add the file to the upload list.
- If the file is present in the checkpoint metadata, but the file metadata does not match the checkpoint metadata (size or mtime are different) then add it to the the upload list.
- If the file is present in the checkpoint metadata, and the file metadata does match the checkpoint metadata, but the file in the collection is smaller than the file on disk, then add the file to the resume list (append the missing remaining data)
In "update" mode, keep the same behavior as above, but also create a Collection object for the remote collection record. Compare each file between the collection object from the checkpoint (can just use == instead of manifest_text()) and the remote collection. Any file which is different between the local checkpoint manifest and the remote collection is added to the upload list.
Updated by Lucas Di Pentima about 8 years ago
- Target version changed from 2016-11-09 sprint to 2016-11-23 sprint
Updated by Lucas Di Pentima about 8 years ago
- Story points changed from 1.0 to 0.5
Updated by Lucas Di Pentima about 8 years ago
Updates pushed at 3885bc2e043ce123929acceaaa8dbbf20dbb8c12
Tests being run at: https://ci.curoverse.com/job/developer-run-tests/66/
- The cache file saves the collection UUID when uploading for the first time.
- On initial upload success, the cache file is not deleted, so it can be used for updating the collection.
- Once the initial collection is done, the manifest saved in the cache file is used to detect any remote file changes, so the local file is then re-uploaded completely is necessary.
- If there weren't remote changes from the last upload/update, new files are uploaded and partial uploads resumed.
- If a new upload (to create a new collection) is attempted without deleting the previous cache file, a message will alert the user and the operation will be cancelled.
Because the remote collection UUID is saved on the cache file, we could allow the user to start an update without the need to specify it, and just using the --update-collection
flag.
Updated by Lucas Di Pentima about 8 years ago
Some sdk/cli
tests were failing because arv-put
was called repeatedly on the same directory by different tests, so the same resume cache file was used. The problem is caused when arv-put
finishes an upload with --manifest
: it "binds" that cache file to the saved collection and won't run to upload to another one unless --no-resume
is used or the cache file is deleted.
I added --no-resume
to the failing tests, because they don't test that feature.
New test suite run at: https://ci.curoverse.com/job/developer-run-tests/67/
Updated by Peter Amstutz about 8 years ago
- I'm not sure why it is necessary to save (and check) the collection uuid? The rsync behavior should work for any destination collection.
- Ideally, the only difference between "update" and "resume" should be that the first one updates an existing collection, and the second one create a new collection.
- Should --update-collection, --resume and --no-resume all be mutually exclusive options?
- In
_write_file
is it necessary to wrap most of the behavior inif self.resume
? Whenresume
is False it should always write the file because the cache is empty. - Similarly, most of the
if self.update
tests should be unnecessary if the conditions for resume/no-resume/update are already set up. - (Counter argument, checking flags may avoid some unnecessary work, but it makes the code harder to follow)
- I don't think we want to do this, this would mean any change to the file which increases the file size will be assumed to be an append (this defeats the purpose of using mtime to detect file changes):
elif cached_file_data['mtime'] < os.path.getmtime(source) and cached_file_data['size'] < os.path.getsize(source):
- In
_write
, it is not necessary to do a zero-length write in order to create an empty file (thefirst_read
flag). (That may have been true for the oldCollectionWriter
but is not true for the currentCollection
class). - The
Collection
class is already thread safe, so some instances ofwith self._collection_lock
may be unecessary (particularly when only calling a single method on the Collection object). - Why do deferred initialization with
_my_collection()
when the Collection only needs to be set up once, perhaps in_setup_state()
? - Suggest using
os.walk
instead of callingos.listdir
recursively. Also, it needs to sort the list of files and directories so they are added in a deterministic order. - I think there should be a distinction between starting with an empty cache (--no-resume) and not saving a cache (--no-cache?)
If possible I would like to see the "resume" and "update" flags on ArvPutUploadJob go away, so there is just one code path determining what to upload based on the source directory, cache, and destination collection. I believe this will help simplify the code in a number of places and make it clearer what is supposed to happen.
Updated by Lucas Di Pentima almost 8 years ago
Changes at: 2225be2
Tests at: https://ci.curoverse.com/job/developer-run-tests/78/
Sorry Peter about the previous mess, it got unnecessarily too complex because I modified the first version instead of re-writing it. Also, I got to understand more clearly the idea behind this story so this version is a lot simpler and more feature full.
- Now
--update-collection
won't bind a cache file to an existing collection, the use will be able to update any existing collection. - The options
--update-collection
and--resume
can now be used together as they offer independent features. - Added a new option
--no-cache
to avoid creating or using an existing cache file. - Enhanced the cache file index to use absolute paths so that using the command from different
CWD
won't affect its logic. - Removed unnecessary uses of
self._collection_lock
. - Simplified & fixed the upload/resume decision making logic.
- Use of
os.walk()
to traverse directories, in a way that's deterministic even thoughos.walk()
is a generator. - Eliminated superfluous code when writing an empty file.
Updated by Lucas Di Pentima almost 8 years ago
- Target version changed from 2016-11-23 sprint to 2016-12-14 sprint
Updated by Peter Amstutz almost 8 years ago
Please use a logger instead of print >>stderr
.
The option --max-manifest-depth
is obsolete and should be marked as such (and also removed from the md5 computation in make_path
). We don't want to remove it to avoid breaking backwards compatability, but it might make sense to hide it from the help text.
The update_time
for checkpointing seems too frequent. Suggest larger period, maybe every 20 seconds.
In save_collection
, before doing self._remote_collection.copy
we need to check that signing tokens from the local manifest are not about to expire. If the tokens are expired, the files need to be re-uploaded (unless/until we have an API for refreshing tokens).
I started uploading a file, then hit ^C. I did arv-put again and it resumed. So far, so good. Then I did "touch" on the file and did arv-put again. I expected it would restart the upload, because a change in mtime is assumed to be a change in file content. However, it continued from where it left off. This is a bug.
It also resumes even when --no-resume is provided! It looks like it is failing to delete the cache when resume is False.
--no-cache also appears to be resuming (?)
I would expect that:
--resume means use the cache if available
--no-resume means delete the cache and start over
--cache means read and write a cache
--no-cache means don't read or write a cache
Updated by Lucas Di Pentima almost 8 years ago
Tests at: https://ci.curoverse.com/job/developer-run-tests/96/
Updates: 5646f94
Peter Amstutz wrote:
Please use a logger instead of
print >>stderr
.
Done.
The option
--max-manifest-depth
is obsolete and should be marked as such (and also removed from the md5 computation inmake_path
). We don't want to remove it to avoid breaking backwards compatability, but it might make sense to hide it from the help text.
Updated. Argparse has a feature for cases like this, the argument is accepted but it isn't listed on the help message.
The
update_time
for checkpointing seems too frequent. Suggest larger period, maybe every 20 seconds.
Updated to 20 secs.
In
save_collection
, before doingself._remote_collection.copy
we need to check that signing tokens from the local manifest are not about to expire. If the tokens are expired, the files need to be re-uploaded (unless/until we have an API for refreshing tokens).
Done. For this I added a permission_expired
method to ArvadosFile
class that check all its segments's expirations, with it's own test case.
I started uploading a file, then hit ^C. I did arv-put again and it resumed. So far, so good. Then I did "touch" on the file and did arv-put again. I expected it would restart the upload, because a change in mtime is assumed to be a change in file content. However, it continued from where it left off. This is a bug.
It also resumes even when --no-resume is provided! It looks like it is failing to delete the cache when resume is False.
--no-cache also appears to be resuming (?)
The issue was that the command was reporting its progress wrongly, and that was because instead of checking all files to be uploaded/resumed first, it was doing it file by file, so in case of a resume, the total amount of uploaded bytes were not available until the command finished. Because the real upload is done asynchronously, the bytes_written
value is taken from the local_collection
's size, so it was showing 100% uploaded for example when "touching" a file to re-upload it, because bytes_written was taken from the previous local collection before overwriting the changed file.
I changed this so the progress is shown correctly on upload and resume by going through the entire file list, deciding if it need to be skipped, uploaded or resumed and doing the proper accounting before starting the real upload process.
--no-resume means delete the cache and start over
Updated.
Also added some tests to confirm that --no-resume
and --no-cache
work as intended.
Updated by Peter Amstutz almost 8 years ago
I'm getting an error every time I run it:
$ arv-put . 0 2016-12-12 16:01:59 arvados.arv_put[32544] ERROR: 962eh-4zz18-f6k8xl2558krll0 $ find . ./b1 ./foo1 ./foo1/blah ./foo2 ./foo2/blah2
These files are all zero-length, which might be trigging an edge case.
It seems to be uploading the collection correctly, so the bug may just be in the non-error error being logged.
Updated by Lucas Di Pentima almost 8 years ago
Test run: https://ci.curoverse.com/job/developer-run-tests/97/
Updates at: b5af634
- Fixed the non-error error message.
- Also added
--dry-run
argument so that it checks if there're files pending for upload (in which case it exits with return code 2 instead of 0). - Added some tests for this new argument.
Updated by Peter Amstutz almost 8 years ago
112.0% ? oops.
$ arv-put . 7M / 6M 112.0% 962eh-4zz18-dkn2si9w0nic8rw
Updated by Peter Amstutz almost 8 years ago
Also, --dry-run should not print anything.
$ arv-put --dry-run --update 962eh-4zz18-dkn2si9w0nic8rw . 7M / 6M 112.0% 962eh-4zz18-dkn2si9w0nic8rw
Updated by Lucas Di Pentima almost 8 years ago
- Target version changed from 2016-12-14 sprint to 2017-01-04 sprint
Updated by Lucas Di Pentima almost 8 years ago
Peter Amstutz wrote:
112.0% ? oops.
[...]
The problem was happening when some uploaded file was deleted from the local disk previous to an arv-put re-run: the collection from the cache file manifest (aka: the local_collection) is the source of truth when asking how many bytes we're uploading, so in those cases when that collection is bigger than the local files, the error occurred. Now I added some code to delete those excess files from the local_collection, correcting the issue.
If your case was different, please explain how you got a >100% progress value.
Also, made some fixes to the dry-run code so it doesn't print anything to stdout and actually passes the args.dry_run to the ArvPutUploadJob class.
Test run: https://ci.curoverse.com/job/developer-run-tests/101/
Updates: 55e9587e140e1621dcd374afcf30ee0aec2c6d24
Updated by Peter Amstutz almost 8 years ago
It is correctly reporting 100% now.
I have one more request and then LGTM:
arv-put --name "new name" --update-collection 962eh-4zz18-b41l73mftcoqpkm python
→ should update the collection name to "new name" (currently it doesn't change the collection name).
Updated by Lucas Di Pentima almost 8 years ago
As we talked about avoiding further scope creep, I've made a small update so that arv-put
don't allow the use of --name
together with --update-collection
, exiting with an error message.
Also I merged master into the branch and started a complete test run.
Test run: https://ci.curoverse.com/job/developer-run-tests/104/
Updates: dcca309
Updated by Lucas Di Pentima almost 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:f2e2487ddba3944d3acd46fae3424a87fc624be9.