https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422014-05-07T12:02:21ZArvadosArvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=102532014-05-07T12:02:21ZTom Cleggtom@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/10253/diff?detail_id=8386">diff</a>)</li></ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=103412014-05-07T15:33:35ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Brett Smith</i></li></ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=105022014-05-14T12:05:23ZTom Cleggtom@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/10502/diff?detail_id=8694">diff</a>)</li></ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=107572014-05-26T08:59:09ZBrett Smithbrett.smith@curii.com
<ul></ul><p>I just uploaded 2752-arv-put-resume-wip for review. This implements approach #2 in the simplest possible way. Every time arv-put writes data to Keep, it serializes its state to disk. If there's already state available, it will check that for freshness, and resume if everything checks out.</p>
<p>Files on disk are considered unchanged if their mtime and size stay the same. This is the same heuristic that rsync uses by default. arv-put saves every file's full stat data, so we have room to change the heuristic in the future if we wish.</p>
<p>A few other notes that might help the review:</p>
<ul>
<li>Testing infrastructure for our Python modules was moved to a <code>tests</code> subdirectory for a cleaner file tree. Running <code>git diff -M</code> will report those renames without including their full diff. Run tests from a module directory with <code>python -m unittest discover tests</code></li>
<li><a class="changeset" title="2752: Move arv-put functionality to arvados.commands.put. This will make it easier to unit test ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/a1e2086ee001f2aaa26244c76f1ebbe1921e5682">a1e2086e</a> moves arv-put's functionality to arvados.commands.put for improved testability. This is also mostly a rename, but I did move code into functions, so you get a huge diff even with -M.</li>
</ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=108232014-05-27T17:10:08ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><ol>
<li>General comment: could use more comments outlining what the code is supposed to be doing. The most useful comments capture intention -- "this is what the code is supposed to be doing" -- so that as a reviewer I can check the intention against what the code is actually doing.</li>
<li>ResumeCache constructor -- bad form to use exceptions for expected control flow, much clearer to test isinstance(args, str)</li>
<li>ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.</li>
<li>expected_bytes_for(pathlist) has a section:<br /><pre>
if os.path.isdir(path):
for filename in arvados.util.listdir_recursive(path):
bytesum += os.path.getsize(os.path.join(path, filename))
elif not os.path.isfile(path):
return None
else:
bytesum += os.path.getsize(path)
</pre><br />Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).</li>
</ol> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109132014-05-28T17:09:35ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><ol>
<li>0% bug: <br /><pre>
16M / 33M 0.0%
</pre></li>
<li>When I interrupt it and then start it again, it sometimes takes a while before the progress line shows up (presumably because it is finishing the current block). It should print the starting progress state immediately.</li>
<li>When resuming, it should print a message to the user like "resuming upload X, use --no-resume if you want to start over" </li>
<li>Displaying bytes per second and estimated time remaining on the progress line would be very nice</li>
<li>A --verbose option that prints each file name as it finishes uploading would be nice</li>
<li>run_test_server.py is broken for external tests (sdk/go/src/arvados.org/keepclient and services/keep/src/arvados.org/keepproxy) that use it to help run integration tests.</li>
</ol> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109232014-05-29T09:42:00ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<ol>
<li>ResumeCache.make_path() -- I like the idea of hashing the filename or path to determine the checkpoint file. I think this could be expanded to accomplish the function of ResumableCollectionWriter.check_dependencies() by hashing the entire file list, with size and mtime, so it only resumes if nothing has changed. Alternately, ResumableCollectionWriter.check_dependencies() could be more subtle and properly detect which files are changed/added (+ queued) and only upload those, instead of raising StaleWriterError.</li>
</ol>
</blockquote>
<p>To capture some of our discussion for yesterday: the second idea is sort of what I started out trying to implement, but it's tricky because what CollectionWriter actually puts in Keep is affected not only by the contents of the files/directories, but also command-line arguments like --filename and --max-manifest-depth. So you have to cache results not only based on the contents of files, but how they're being stored in Keep as well. This seems like it ought to be doable, but figuring out how to do it in a way that would be most useful for the immediate use case in this story seemed like a bigger approach than was called for.</p>
<p>The current approach lets us generate a very specific reason why we're not reusing a cache, which are reflected in the different messages of StaleWriterStateErrors raised. arv-put isn't doing anything with that information currently, but the current architecture leaves our options open. If that work became part of the process of generating a cache filename, you have no way of knowing why an upload didn't resume—all you know is things hashed out differently.</p>
<blockquote>
<ol>
<li>expected_bytes_for(pathlist) has a section:<br />[...]<br />Since the other code paths return int, the "None" code path seems likely to cause an exception (I don't know if you wrote this code or it was there before, but now would be a good time to fix it either way).</li>
</ol>
</blockquote>
<p>Ultimately this result makes it to the progress-reporting functions, which sometimes divide bytes_written by bytes_expected to generate a percentage. Any integer sentinel value we choose would have to be guarded against just as None is: 0 threatens to divide by zero, and anything negative would produce nonsense percentages. Given that there <strong>must</strong> be a guard, I think None most clearly expresses the intent of "we don't know how many bytes to expect."</p>
<blockquote>
<ol>
<li>Displaying bytes per second and estimated time remaining on the progress line would be very nice</li>
<li>A --verbose option that prints each file name as it finishes uploading would be nice</li>
</ol>
</blockquote>
<p>I think these features both sound cool, but they're not immediately related to this story and it would be preferable to task them out separately.</p>
<p>I've pushed more commits that address your other comments; the branch is currently at <a class="changeset" title="2752: Adapt Go tests to the new Python test infrastructure. Refs #2752. There's two pieces to t..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/891d6b276f1e37d85d1a66496e045cd7ab6807e1">891d6b2</a>. Please take another look. Thanks.</p> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109422014-05-29T11:28:31ZBrett Smithbrett.smith@curii.com
<ul></ul><p>From in-office discussion:</p>
<ul>
<li>Resumed uploads report progress past 100%.</li>
<li>Too much performance regression from master.</li>
</ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109482014-05-29T13:11:57ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Brett Smith wrote:</p>
<blockquote>
<p>From in-office discussion:</p>
<ul>
<li>Resumed uploads report progress past 100%.</li>
<li>Too much performance regression from master.</li>
</ul>
</blockquote>
<p>Both fixed as of <a class="changeset" title="Merge branch 'master' into 2752-arv-put-resume-wip" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/159f578decf491f23961440d9b4e60f32f2fc231">159f578</a>. Ready for another look.</p> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109502014-05-29T13:14:52ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>$ arv-put golang-1.2.1/<br />Traceback (most recent call last):<br /> File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module><br /> main()<br /> File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main<br /> writer.cache_state()<br />UnboundLocalError: local variable 'writer' referenced before assignment</p> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109522014-05-29T14:03:00ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Using ^C I get at least two different stack traces, we should treat ^C as expected and handle it gracefully.</p>
<p>The first time I run arv-put and hit ^C I get this:<br /><pre>
$ arv-put golang-1.2.1/
10M / 33M 29.9% ^CTraceback (most recent call last):
File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
main()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 349, in main
path, max_manifest_depth=args.max_manifest_depth)
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 286, in write_directory_tree
path, stream_name, max_manifest_depth)
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 264, in write_directory_tree
self._do_queued_work()
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 192, in _do_queued_work
self._work_dirents()
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 218, in _work_dirents
self.start_new_stream(stream_name)
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 315, in start_new_stream
self.finish_current_stream()
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 329, in finish_current_stream
self.flush_data()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 265, in flush_data
super(ArvPutCollectionWriter, self).flush_data()
File "/home/peter/work/arvados/sdk/python/arvados/collection.py", line 280, in flush_data
self._current_stream_locators += [Keep.put(data_buffer[0:self.KEEP_BLOCK_SIZE])]
File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 126, in put
return Keep.global_client_object().put(data, **kwargs)
File "/home/peter/work/arvados/sdk/python/arvados/keep.py", line 485, in put
t.join()
File "/usr/lib/python2.7/threading.py", line 949, in join
self.__block.wait()
File "/usr/lib/python2.7/threading.py", line 339, in wait
waiter.acquire()
KeyboardInterrupt
</pre></p>
<p>Each subsequent time I hit ^C I get this. In addition, each time it restarts at 10M instead of whatever it was interrupted at.</p>
<pre>
arv-put golang-1.2.1/
arv-put: Resuming previous upload. Bypass with the --no-resume option.
10M / 33M 29.9% ^CTraceback (most recent call last):
File "/home/peter/work/arvados/sdk/cli/bin/arv-put", line 4, in <module>
main()
File "/home/peter/work/arvados/sdk/python/arvados/commands/put.py", line 354, in main
writer.cache_state()
UnboundLocalError: local variable 'writer' referenced before assignment
</pre> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109612014-05-29T15:35:27ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2014-05-28 Pipeline Factory</i> to <i>2014-06-17 Curating and Crunch</i></li></ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109692014-05-29T17:00:33ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> changed from <i>2014-06-17 Curating and Crunch</i> to <i>2014-05-28 Pipeline Factory</i></li></ul><ul>
<li>arv-put now installs signal handlers to write state on SIGINT, SIGTERM, and SIGQUIT.</li>
<li>State is also cached approximately every 64MiB, so we can recover from pull-the-plug scenarios.</li>
</ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109742014-05-30T09:08:27ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> changed from <i>2014-05-28 Pipeline Factory</i> to <i>2014-06-17 Curating and Crunch</i></li></ul> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109752014-05-30T09:09:41ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul>Looks good! Almost there. Just a few more tweaks.
<ol>
<li>Change the resume text to say <br /><pre>
Resuming previous upload from last checkpoint. Use the --no-resume option to start over.
</pre></li>
<li>When it finished the upload, I get this:<br /><pre>
33M / 33M 99.9% fd18ec02e561035f1555c368e68d8b76+110336
</pre>
<ol>
<li>It should show 100%</li>
<li>The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N" </li>
</ol>
</li>
<li>It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.</li>
</ol> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109782014-05-30T09:53:22ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
Looks good! Almost there. Just a few more tweaks.
<ol>
<li>Change the resume text to say <br />[...]</li>
</ol>
</blockquote>
<p>Done.</p>
<blockquote>
<ol>
<li>When it finished the upload, I get this:<br />[...]
<ol>
<li>It should show 100%</li>
</ol></li>
</ol>
</blockquote>
<p>Done.</p>
<blockquote>
<ol>
<li>The manifest id should go on its own line, possibly with some text saying "saved to collection id XXX+N"</li>
</ol>
</blockquote>
<p>It's on its own line now. We can't change the output format because tools like crunch-job parse it as is.</p>
<blockquote>
<ol>
<li>It should retain the checkpoint file even for completed uploads, so if I try uploading the directory a second time it is smart enough to know it already did the work.</li>
</ol>
</blockquote>
<p>Per discussion on IRC: this sort of caching would be a cool feature, but it's not what was specced for this story. This is about resuming interrupted uploads. Deleting the state file is helpful because it avoids messages about restarting uploads that successfully finished. It also follows the philosophy of "there was no interruption, so there's nothing to resume."</p>
<p>Changes pushed to <a class="changeset" title="2752: arv-put explains resumed uploads in more detail. Wording suggested by Peter in refs #2752." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cbd6078842b5b2893d5eded02ee14b1d47432754">cbd6078</a>.</p> Arvados - Idea #2752: arv-put can quickly resume an interrupted transfer.https://dev.arvados.org/issues/2752?journal_id=109842014-05-30T10:50:12ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:ffe3cdbc8c37e2b4a4e3ea4f67c1c9ca5d81e2ed.</p>