Story #9463

[SDKs] Change arv-put to use the Collection class under the hood

Added by Brett Smith over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/11/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Right now arv-put does uploads using the CollectionWriter class. Reimplement it so that it uploads using the Collection class rather than CollectionWriter.

There should be no UI or UX changes as part of this story. No new command line switches, no new output. The organization of files inside collections should remain as it is now. It is OK if the implementation divides blocks differently than the current version, as long as the end result is the same when you re-get the collection.

Doing this will enable further development we're interested in like #8910.

Acceptance criteria:

  • Must maintain existing command line behavior.
  • Must be able to resume from interrupted upload, including the middle of large files
  • On resume, must detect if previously uploaded files on disk have changed and re-upload
  • If saving a partial collection, must not save it under the user-provided destination name and project (either save the manifest locally, or alter the destination name when saving an incomplete collection, like rsync)

Subtasks

Task #9468: Review 9463-change-arvput-use-collection-classResolvedLucas Di Pentima

Task #9631: Write new tests on Python SDK arv_put.ArvPutUploadJob classResolvedLucas Di Pentima


Related issues

Blocks Arvados - Bug #8910: [SDK] arv-put should save manifest text on API errorResolved

Blocked by Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blocksResolved08/03/2016

Associated revisions

Revision 67c0bb9a (diff)
Added by Lucas Di Pentima over 4 years ago

9463: Finished first draft on arv-put command use of Collection class. Also, partial uploads resuming is working. Tests were written to validate this. Refs #9463 #8910

Revision 9b98829c
Added by Lucas Di Pentima over 4 years ago

Merge branch '9463-pysdk-amended'

Refs #9463

Revision 5af68e10 (diff)
Added by Lucas Di Pentima over 4 years ago

9701: Fixed arv-put to pass a couple of pending sdk/cli tests. refs #9463

Revision c98596a4
Added by Lucas Di Pentima over 4 years ago

Merge branch '9701-collection-pack-small-files-alt'
Closes #9701 #9463

Revision 255ad79d
Added by Lucas Di Pentima over 4 years ago

Merge branch '9463-revert-arv-put-commit'
Refs #9463 #9701

Revision 29389eac
Added by Lucas Di Pentima over 4 years ago

Merge branch '9463-revert-arv-put-commit'
Refs #9463 #9701

History

#1 Updated by Brett Smith over 4 years ago

  • Assigned To set to Lucas Di Pentima

#2 Updated by Brett Smith over 4 years ago

  • Description updated (diff)

#3 Updated by Lucas Di Pentima over 4 years ago

  • Status changed from New to In Progress

#4 Updated by Peter Amstutz over 4 years ago

Suggested design for upload / resume.

Initial setup:

  • Get list of files to be uploaded
  • Hash the list of files to be uploaded (something like ResumeCache.make_path)
  • Create new, empty collection record on the API server
  • Create an upload checkpoint storing the uuid of the collection being uploaded to
    along with the timestamp and size of each file at the time of upload

For each file to upload:

  • Use Collection.open() to get an ArvadosFileWriter for the file.
  • Use ArvadosFile.write() to write to the file
    • Use ArvadosFile.flush() to synchronously commit the block (you will need to count bytes and call flush() after writing exactly KEEP_BLOCK_SIZE bytes to avoid committing truncated blocks)
  • After flushing a block, check if we've saved the collection in the last N seconds. If not, save the collection using Collection.save().
  • Closing the file object will flush it.

To resume from checkpoint:

  • Get list of files to be uploaded
  • Hash the list of files to be uploaded (something like ResumeCache.make_path)
  • Read the checkpoint file
  • Get the collection record from the API server listed in the checkpoint file
  • Compare the list of files in the cache against the actual files on disk. If the size or timestamp of any file has changed, mark it as "dirty"
  • Any files marked as "dirty" should be deleted from the Collection record.
  • Compare the list of files against the remaining files in the Collection.
  • If the file is present in the collection and has the same size, it is complete and does not need to be uploaded.
  • If the file is present in the collection but the size is smaller than the size on disk, the file is incomplete and should be resumed where it left off.
  • If the file is present in the collection but the size is larger than the size on disk, that's an error, the file should be skipped (?)
  • If the file is missing from the collection, it needs to be uploaded.

(I think you can follow this same code path whether the checkpoint is new or resumed, because the logic is the same if the collection starts out empty.)

#5 Updated by Peter Amstutz over 4 years ago

From IRC:

You could go through the file locally and split it into blocks, and compare those blocks to the blocks that make up the file in the Collection. This would enable more robust detection if a file has changed, or when a local file is longer than the file in the collection (possibly due to an interrupted transfer), allows for efficient append.

#6 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

Acceptance criteria:

  • Must maintain existing command line behavior.
  • Must be able to resume from interrupted upload, including the middle of large files
  • On resume, must detect if previously uploaded files on disk have changed and re-upload

#7 Updated by Brett Smith over 4 years ago

Peter Amstutz wrote:

  • Hash the list of files to be uploaded (something like ResumeCache.make_path)

Note that other arv-put options can affect whether or not this is really the "same" upload, like which API host we're talking to, and the value of --max-manifest-depth when we're uploading a directory. These also need to be considered, as ResumeCache.make_path does now.

  • Create new, empty collection record on the API server

Presumably you only want to do this after determining that there's no collection to resume from?

To resume from checkpoint:

  • Get list of files to be uploaded
  • Hash the list of files to be uploaded (something like ResumeCache.make_path)
  • Read the checkpoint file
  • Get the collection record from the API server listed in the checkpoint file

The checkpoint file needs to include the last known portable_data_hash and etag of the collection. When we get the collection record, if those attributes don't match our record, we should start from a new collection, because apparently the collection has been modified since our upload was interrupted, and we shouldn't risk overwriting those changes.

  • Compare the list of files in the cache against the actual files on disk. If the size or timestamp of any file has changed, mark it as "dirty"
  • Any files marked as "dirty" should be deleted from the Collection record.

You can extract this logic from CollectionWriter.check_dependencies. It also refuses to resume anything that isn't a regular file, because there's no way to know whether something like a FIFO or block device will return the same data when you read it a second time. Where check_dependencies makes a single decision based on the entire set of inputs, your branch will want to make this decision on a file-by-file basis.

  • Compare the list of files against the remaining files in the Collection.
  • If the file is present in the collection and has the same size, it is complete and does not need to be uploaded.
  • If the file is present in the collection but the size is smaller than the size on disk, the file is incomplete and should be resumed where it left off.
  • If the file is present in the collection but the size is larger than the size on disk, that's an error, the file should be skipped (?)

The checks for "same size" and "larger size" should be unnecessary after the portable_data_hash check suggested above.

#8 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-07-06 sprint to 2016-07-20 sprint

#9 Updated by Lucas Di Pentima over 4 years ago

  • Story points changed from 3.0 to 1.0

#10 Updated by Tom Clegg over 4 years ago

I haven't looked closely at the actual resumable-arv-put stuff yet. In the meantime here are comments about the updated Collection class:

In test_arvfile.py, the discovery doc might as well look a bit more realistic:

- ... {'replication_desired': 2}
+ ... {'replication_desired': {'type':'integer'}}

_BlockManager should use the same name for the constructor kwarg and the instance variable -- pick one of "copies" or "num_put_copies".

The "API called it 'redundancy' before #3410" stuff is 1.5 years old now. I think we can get rid of this and just use 'replication_desired'.

Can we put the "look up default replication if none provided" code in the Collection class instead of put.py? That way, other programs will also automatically get the site-wide default.

In Collection, suggest renaming "num_write_copies" to "replication" or "replication_desired". Using different values for "num_write_copies" and "replication_desired" is an unusual use case to cater to. I think we should make it convenient to use the same value both times, i.e.:
  • accept a "replication_desired" value via constructor or by setting an instance variable
  • when writing blocks, use replication_desired (or the site-wide default, if none provided)
  • in save() and save_new(), instead of requiring the caller to specify desired replication again, use self.replication_desired (even if it's None)

If a caller instantiates a Collection from an existing collection, and then saves it, is the original replication_desired value preserved? (Reading the code it looks like it will get reset to default, which seems wrong.) Also, if a caller loads a collection and adds/modifies some files, we should respect the existing replication_desired value when writing. IOW, when loading an existing collection, we should set self.replication_desired (unless the caller has already overridden it via kwargs).

Formatting / git stuff:

In put.py, just change main(), instead of adding main_new() beside it and leaving main() as dead code. (When we want to see old/dead code, we can use git...)

Clean up whitespace -- with "git config --global core.whitespace tab-in-indent,trailing-space", "git diff" should highlight whitespace errors in red. See Coding Standards

Branch commits should just reference the issue as "1234: ..." -- don't add the "refs #xxxx" until you actually merge the branch. Otherwise, the issue page can end up with a bunch of refs to commits that never got merged.

In case you weren't already planning to rebase/reword -- try to write commit messages that are helpful when they show up in git-blame -- e.g., "Resume cache implementation" is a clue about why the code got the way it is, but "Polishing the last details" really isn't. See Coding Standards

#11 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#12 Updated by Tom Clegg over 4 years ago

Peter Amstutz wrote:

  • Use ArvadosFile.flush() to synchronously commit the block (you will need to count bytes and call flush() after writing exactly KEEP_BLOCK_SIZE bytes to avoid committing truncated blocks)

I don't think arv-put should second-guess block boundaries or demand lots of extra flush() points. I'd rather have arv-put just write() all the data to files and let the Collection class do its thing. Surely it's more important to let the Collection pack small files into big blocks, and write out data blocks asynchronously, than to minimize the number of overlapping bytes on resume.

#13 Updated by Tom Clegg over 4 years ago

Couple of early comments

Should use "finally" instead of "except ... raise"

"if isinstance(item, arvados.arvfile.ArvadosFile):" ... how about "if collection then X else Y" so we don't rely on as many assumptions about what's in a Collection? (E.g., if Collection starts using some other class than ArvadosFile, then it might be better to crash than to go ahead as if size==0.)

_datablocks_on_item looks like it knows way too much about ArvadosFile, and the "bufferblock" stuff looks like it could output locators for blocks that haven't actually been written, which isn't right. How about calling manifest_text() to ensure all blocks are committed, then return self._my_collection().data_locators()?

#14 Updated by Lucas Di Pentima over 4 years ago

e3a7370e35792c4cdfae441099d18d1b0d18e5b3

Addressed Tom comments:

  • Using finally: on except: clause to stop the updater thread on error or finished work.
  • When counting the collection size, stopped ignoring items that are not Collection or ArvadosFile, in case some other class is used in the future.
  • When getting block locators (for example when called with --raw), don\'t make sure the blocks are flushed to Keep before asking for segments

#15 Updated by Lucas Di Pentima over 4 years ago

1c3f80531764931c241cd07a3cbc56892b645bce

Updated integration test to mock current implementation. All tests passing now.

#16 Updated by Lucas Di Pentima over 4 years ago

  • Target version changed from 2016-07-20 sprint to 2016-08-03 sprint
  • Story points changed from 1.0 to 0.5

#17 Updated by Lucas Di Pentima over 4 years ago

1487c0577921c88801c377d753d1322fd1485968

Finished writing the "upload a large file, fail in the middle of the process and resume uploading" test. It was a hard problem: syncing the fact that the upload should at least write a block before the partial manifest is saved in the cache, and also quitting (as in a simulated error) uploading after the partial manifest was saved in the cache. Being those two processes asynchronous and both ultimately calling the KeepClient.put() method, mocking that method caused issues.
I was finally able to write a reliable test by proxying ArvadosFile.writeto() method, and setting up a pair of locks to sync all up.

#18 Updated by Tom Clegg over 4 years ago

In _setup_state, there are three different code blocks for initializing self._state to empty values. How about using an EMPTY_STATE class constant, and saying self._state = copy.deepcopy(self.EMPTY_STATE) in any case where the cached state can't be loaded, or is incomplete?

_setup_state() allows loading a cache with only 'manifest_text' but no 'files' key, and vice versa. Is there a scenario where this is expected to happen? If the saved state is incomplete/corrupt it seems safer to initialize to empty state.

This line in write_file() looks at first like it should be protected with _state_lock. But (if I'm reading correctly) the main thread is the only thread that writes, so it's safe for the main thread to _read state without locking it, right?
  • cached_file_data = self._state['files'][source]

In _write_file(), in if source not in self._state['files'].keys() I think keys() is superfluous.

In _write_file(), avoid some extra stat calls by using cached_file_data['size'] instead of os.path.getsize(source) in the "if mtime and size match" block -- we are already counting on them being equal.

In _write_file(), "if self.resume and resume_upload:" could be just "if resume_upload" ... or it could even be "if resume_offset > 0" and the resume_upload variable wouldn't be needed at all.

The "Inconsistent cache, re-upload the file" case seems fishy. This seems to mean the file grew during a previous arv-put attempt (thereby allowing us to upload more bytes than os.path.getsize(source)) and then shrank to its original size and got backdated before we began the current attempt (thereby allowing us to pass the "mtime and size match" test). I can imagine scenarios where this could happen (you could make this happen with a series of "mv", for example) and the way we handle it seems reasonable, but perhaps we should also log a warning, in case the user is under some illusion that the source dir is stable enough for arv-put to have well-defined results?

In _save_state, it looks like ResumeCacheConflict could only happen if someone else locks the new tempfile before we even rename it into place. This seems like it should never happen, even if someone starts/resumes thousands of identical uploads at once, right? In that case I think we shouldn't silently ignore ResumeCacheConflict exceptions. Also: if other errors happen here (can't create temp file, can't rename it into place, etc) we should probably at least log a warning message. As it stands, conditions like "disk full" or "cache dir not writable" will allow the upload to continue (which seems reasonable) but the user won't even be told that progress isn't being saved, or given any clues about why it's not working (which seems cruel).

When we call _update() from _update_task() in order to report progress, I don't think we can call _my_collection().manifest_text(): that includes commit_all, so it blocks the main thread while it flushes all data blocks to Keep, and it will also interfere with Collection's block-packing scheme. There isn't yet a public API to get the manifest without flushing, so for now I think the lesser evil is to call _get_manifest_text(".", strip=False, normalize=False) from update(). Of course, at the end of the upload, we really do need to commit all data -- maybe just add a call to manifest_text() before _update()?

Putting destroy_cache in an "else" block is superfluous here:
  •     if status != 0:
            sys.exit(status)
        else:
            writer.destroy_cache()

Tests look like they will leak temp files when they fail. Suggest creating a tempdir in setUp() and removing it in tearDown(), so any temp files created during the test itself gets cleaned up even when a test fails/crashes.

Also, note to selves: don't forget note-10 above

#19 Updated by Lucas Di Pentima over 4 years ago

e69ce852e1cbbe5bab82e32ec5d1874ef5a768f3

Almost all points are ready for review, tried to split them in short commits so it's easier to follow. I'm working on the modified test case for the large file partial upload & resume case.

Note-10 comments

I haven't looked closely at the actual resumable-arv-put stuff yet. In the meantime here are comments about the updated Collection class:

In test_arvfile.py, the discovery doc might as well look a bit more realistic:

[...]

Updated.

_BlockManager should use the same name for the constructor kwarg and the instance variable -- pick one of "copies" or "num_put_copies".

Done: picked "copies" because is the way it's called everywhere. The idea of using num_put_copies was to acommodate to the naming convention of _BlockManager's other parameters.

The "API called it 'redundancy' before #3410" stuff is 1.5 years old now. I think we can get rid of this and just use 'replication_desired'.

Done.

Can we put the "look up default replication if none provided" code in the Collection class instead of put.py? That way, other programs will also automatically get the site-wide default.

In Collection, suggest renaming "num_write_copies" to "replication" or "replication_desired". Using different values for "num_write_copies" and "replication_desired" is an unusual use case to cater to. I think we should make it convenient to use the same value both times, i.e.:
  • accept a "replication_desired" value via constructor or by setting an instance variable
  • when writing blocks, use replication_desired (or the site-wide default, if none provided)
  • in save() and save_new(), instead of requiring the caller to specify desired replication again, use self.replication_desired (even if it's None)

Updated. The save() function updates the collection record so I think it's not necessary to set the replication_desired parameter because it would already be set on save_new()

If a caller instantiates a Collection from an existing collection, and then saves it, is the original replication_desired value preserved? (Reading the code it looks like it will get reset to default, which seems wrong.) Also, if a caller loads a collection and adds/modifies some files, we should respect the existing replication_desired value when writing. IOW, when loading an existing collection, we should set self.replication_desired (unless the caller has already overridden it via kwargs).

Added relevant code inside _populate_from_api_server(), because I suppose it's the case when this information will be available. Wrote a couple of tests to validate this change.

Formatting / git stuff:

In put.py, just change main(), instead of adding main_new() beside it and leaving main() as dead code. (When we want to see old/dead code, we can use git...)

Yes, I've been getting used to run "git diff" a lot, lately. Sorry for that! :)

Note-18 comments

In _setup_state, there are three different code blocks for initializing self._state to empty values. How about using an EMPTY_STATE class constant, and saying self._state = copy.deepcopy(self.EMPTY_STATE) in any case where the cached state can't be loaded, or is incomplete?

_setup_state() allows loading a cache with only 'manifest_text' but no 'files' key, and vice versa. Is there a scenario where this is expected to happen? If the saved state is incomplete/corrupt it seems safer to initialize to empty state.

Thanks for the idea, that code did smelled somewhat funny. Now at least both 'manifest_text' and 'files' need to exist on the cache file to avoid starting a new empty state.

This line in write_file() looks at first like it should be protected with _state_lock. But (if I'm reading correctly) the main thread is the only thread that writes, so it's safe for the main thread to _read state without locking it, right?

Correct, but anyways I changed it to lock before reading just as the rest of the state access code, to maintain an uniform strategy.

In _write_file(), in if source not in self._state['files'].keys() I think keys() is superfluous.

Done.

In _write_file(), avoid some extra stat calls by using cached_file_data['size'] instead of os.path.getsize(source) in the "if mtime and size match" block -- we are already counting on them being equal.

Corrected, thanks!

In _write_file(), "if self.resume and resume_upload:" could be just "if resume_upload" ... or it could even be "if resume_offset > 0" and the resume_upload variable wouldn't be needed at all.

Yes, you're right. I removed the "resume_upload" variable entirely. Less is more!

The "Inconsistent cache, re-upload the file" case seems fishy. This seems to mean the file grew during a previous arv-put attempt (thereby allowing us to upload more bytes than os.path.getsize(source)) and then shrank to its original size and got backdated before we began the current attempt (thereby allowing us to pass the "mtime and size match" test). I can imagine scenarios where this could happen (you could make this happen with a series of "mv", for example) and the way we handle it seems reasonable, but perhaps we should also log a warning, in case the user is under some illusion that the source dir is stable enough for arv-put to have well-defined results?

In _save_state, it looks like ResumeCacheConflict could only happen if someone else locks the new tempfile before we even rename it into place. This seems like it should never happen, even if someone starts/resumes thousands of identical uploads at once, right? In that case I think we shouldn't silently ignore ResumeCacheConflict exceptions. Also: if other errors happen here (can't create temp file, can't rename it into place, etc) we should probably at least log a warning message. As it stands, conditions like "disk full" or "cache dir not writable" will allow the upload to continue (which seems reasonable) but the user won't even be told that progress isn't being saved, or given any clues about why it's not working (which seems cruel).

Added logging on both cases. The ticket asked not to change the command's output but I suppose using the loggin module will redirect the messages to syslog by default, am I right?

When we call _update() from _update_task() in order to report progress, I don't think we can call _my_collection().manifest_text(): that includes commit_all, so it blocks the main thread while it flushes all data blocks to Keep, and it will also interfere with Collection's block-packing scheme. There isn't yet a public API to get the manifest without flushing, so for now I think the lesser evil is to call _get_manifest_text(".", strip=False, normalize=False) from update(). Of course, at the end of the upload, we really do need to commit all data -- maybe just add a call to manifest_text() before _update()?

Working on this, can't get the test case work properly, yet.

Putting destroy_cache in an "else" block is superfluous here:
  • [...]

Corrected!

Tests look like they will leak temp files when they fail. Suggest creating a tempdir in setUp() and removing it in tearDown(), so any temp files created during the test itself gets cleaned up even when a test fails/crashes.

Tests corrected to use precreated temp files and dirs, thanks!

#20 Updated by Tom Clegg over 4 years ago

Can we use the "with mock.patch(...):" form instead of swapping the mock in and out manually? I'm seeing seemingly unrelated test failures after the ArvadosPutTest tests run, and I'm guessing they're caused by stuff like this line in test_api_error_handling() that doesn't clean up after itself...

arvados.collection.Collection.save_new = coll_save_mock
======================================================================
ERROR: test_create (tests.test_arvfile.ArvadosFileWriterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/arvados/sdk/python/tests/test_arvfile.py", line 360, in test_create
    c.save_new("test_create")
  File "/tmp/arvados-build/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1062, in __call__
    return _mock_self._mock_call(*args, **kwargs)
  File "/tmp/arvados-build/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1118, in _mock_call
    raise effect 
ApiError: <HttpError 403 "Forbidden">

It looks like the Collection class is incapable of packing small files into big blocks, even if we add support for close(flush=False). But as far as I can tell, when Collection supports that, we add a close(flush=False) feature, everything else in arv-put will start DTRT. So we can address the Collection deficiency as a separate issue. The only question is whether it's OK for arv-put to lose its block-packing feature, or whether we should block this merge behind the Collection fix.

#21 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-08-03 sprint to 2016-08-17 sprint

#22 Updated by Lucas Di Pentima over 4 years ago

3a1f0b616382230c32439b8d7e17ff75a132a10d

Wrote a much cleaner test using mock.patch as requested. Mocking the ArvadosFileWriter.write() method was a much simpler solution, thanks Tom!
With this test working, I realized that the ArvUploadJob.start() method was not updating and closing the cache file upon failure, so I corrected that moving the corresponding block to the finally: clause.

#23 Updated by Tom Clegg over 4 years ago

Is there a possibility that _get_manifest_text() will contain a "hash+size" entry for a block that hasn't actually been written yet? (I think this could result in arv-put resuming after that block, and then failing at the end when it tries to save the manifest, because that block doesn't have a permission token. Or, if permissions aren't being used, the problem won't be detected until someone actually tries to read the data and it's missing.)

If nothing else prevents this from happening, I think we can fix it in Collection._get_manifest_text() like this:

                for segment in arvfile.segments():
                    loc = segment.locator
                    if arvfile.parent._my_block_manager().is_bufferblock(loc):
-                       loc = arvfile.parent._my_block_manager().get_bufferblock(loc).locator()
+                       continue
                    if strip:
                        loc = KeepLocator(loc).stripped()
                    filestream.append(LocatorAndRange(loc, KeepLocator(loc).size,
                                         segment.segment_offset, segment.range_size))

Although I'm not certain this doesn't have consequences elsewhere, it does seem unlikely that the fake locator returned by _BufferBlock.locator() would ever be useful in any manifest...?

#24 Updated by Tom Clegg over 4 years ago

Other tests are still failing -- I suspect arvados.collection.Collection.save_new = coll_save_mock here is leaking into other tests and needs to be converted to a "with mock.patch":

     def test_api_error_handling(self):
-        collections_mock = mock.Mock(name='arv.collections()')
-        coll_create_mock = collections_mock().create().execute
-        coll_create_mock.side_effect = arvados.errors.ApiError(
+        coll_save_mock = mock.Mock(name='arv.collection.Collection().save_new()')
+        coll_save_mock.side_effect = arvados.errors.ApiError(
             fake_httplib2_response(403), '{}')
-        arv_put.api_client = arvados.api('v1')
-        arv_put.api_client.collections = collections_mock
+        arvados.collection.Collection.save_new = coll_save_mock
         with self.assertRaises(SystemExit) as exc_test:
             self.call_main_with_args(['/dev/null'])
         self.assertLess(0, exc_test.exception.args[0])
-        self.assertLess(0, coll_create_mock.call_count)
+        self.assertLess(0, coll_save_mock.call_count)
         self.assertEqual("", self.main_stdout.getvalue())

#25 Updated by Lucas Di Pentima over 4 years ago

Tom Clegg wrote:

Other tests are still failing -- I suspect arvados.collection.Collection.save_new = coll_save_mock here is leaking into other tests and needs to be converted to a "with mock.patch":

[...]

Corrected at 2c640b0e791d9997ff87f4ec2b33af7781286af5

#26 Updated by Lucas Di Pentima over 4 years ago

Tom Clegg wrote:

Is there a possibility that get_manifest_text() will contain a "hash+size" entry for a block that hasn't actually been written yet? (I think this could result in arv-put resuming after that block, and then failing at the end when it tries to save the manifest, because that block doesn't have a permission token. Or, if permissions aren't being used, the problem won't be detected until someone actually tries to read the data and it's missing.)
[...]
Although I'm not _certain
this doesn't have consequences elsewhere, it does seem unlikely that the fake locator returned by _BufferBlock.locator() would ever be useful in any manifest...?

Corrected at 7b227c653e31449f6c23b4b8d933172bbfb2b172

Added a new parameter to _get_manifest_text() so that we can explicitly ask for a manifest text that include only those committed blocks, because the mentioned behaviour is used by portable_manifest_text() users.
Also added a test for this new feature.

#27 Updated by Tom Clegg over 4 years ago

This looks good but this assertion should surely be an assertEqual, and that would mean we have to decide what the manifest text should look like.

            self.assertNotEqual(
                c._get_manifest_text(".",
                                     strip=False,
                                     normalize=False,
                                     only_committed=True),
                ". 781e5e245d69b566979b86e28d23f2c7+10 0:10:count.txt\n")

I have a suspicion we'll get a horribly wrong manifest if we do something like this:

foo = c.open('foo')
foo.write('foo')

waz = c.open('waz')
waz.write('waz')
waz.close()

self.assertEqual(
    c._get_manifest_text(".",
                         strip=False,
                         normalize=False,
                         only_committed=True),
    ". 4d20280d5e516a0109768d49ab0f3318+3 0:0:foo 0:3:waz\n")
# (". 4d20280d5e516a0109768d49ab0f3318+3 0:3:waz\n" would also be OK)

I think that test will establish whether we're already DTRT. If not, we should figure out how to fix it, because (I think) otherwise arv-put will save bogus manifests in its resume cache.

#28 Updated by Lucas Di Pentima over 4 years ago

Updated test @ fe0d8506f80a7ce5f1d412823a09a5d7324b7161

Collection is doing the right thing when uncommitted blocks exist. The produced manifest text is ok.

#29 Updated by Tom Clegg over 4 years ago

Great. I see only two small things to fix before merging:
  1. Currently ^C seems to stop the upload, but leave the cache/progress updater running forever until SIGKILL. I think the fix is just:
    • +self._checkpointer.daemon = True
       self._checkpointer.start()
      
    • (We could go a bit further by catching interrupts and writing the resume cache one more time before exiting, but the easy fix should be nearly as good in practice, so I propose we stick with the easy fix for now.)
  2. whitespace
    • $ git diff --check master...
      sdk/python/arvados/commands/put.py:475: trailing whitespace.
      +            
      

(Summary of offline discussion) Merging now would mean switching to a new block-packing strategy which we know will be short-lived, so let's merge just the SDK changes (and related test updates) but hold off merging the new arv-put until we teach arvfile/collection how to pack small files into big blocks (#9701).

#30 Updated by Lucas Di Pentima over 4 years ago

  • Target version changed from 2016-08-17 sprint to 2016-08-31 sprint

#31 Updated by Lucas Di Pentima over 4 years ago

6d162bfdf0c3cbfc2d723ecff4153f85daa2d003

Applied suggested change and tried cancelling a big upload using ctrl-c, the thread finished as expected.

#32 Updated by Lucas Di Pentima over 4 years ago

Created a new branch with only those changes to the Python SDK so we can merge it to master.
Please check branch '9463-pysdk-additions-only' @ 19693ea74cd034c4019899959b0f5e66c4fa08ba
Tried a Jenkins run with it (https://ci.curoverse.com/job/developer-test-job/146/) and only failed some unrelated CWL tests, so I think it's OK for merging to master.

#33 Updated by Lucas Di Pentima over 4 years ago

Updated CWL tests so it matches Collection's new behaviour: 63f93d4 (9463-pysdk-amended branch)

#34 Updated by Tom Morris over 4 years ago

  • Target version changed from 2016-08-31 sprint to Arvados Future Sprints

#35 Updated by Radhika Chippada over 4 years ago

  • Target version changed from Arvados Future Sprints to 2016-09-14 sprint

#36 Updated by Radhika Chippada over 4 years ago

  • Target version deleted (2016-09-14 sprint)

#37 Updated by Tom Morris over 4 years ago

  • Target version set to 2016-09-28 sprint

#38 Updated by Radhika Chippada over 4 years ago

  • Target version changed from 2016-09-28 sprint to 2016-10-12 sprint

#39 Updated by Lucas Di Pentima over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados c98596a46dc7c166acb2a26b742a09ae0493224a.

Also available in: Atom PDF