https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422016-08-03T14:41:10ZArvadosArvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=416072016-08-03T14:41:10ZTom Cleggtom@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/41607/diff?detail_id=40303">diff</a>)</li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=427652016-09-06T18:48:18ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> set to <i>Arvados Future Sprints</i></li><li><strong>Story points</strong> set to <i>2.0</i></li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=430982016-09-14T19:51:36ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2016-09-28 sprint</i></li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=431202016-09-14T20:01:47ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Lucas Di Pentima</i></li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=433902016-09-22T20:29:52ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=435392016-09-28T18:39:49ZRadhika Chippadaradhika@curoverse.com
<ul><li><strong>Target version</strong> changed from <i>2016-09-28 sprint</i> to <i>2016-10-12 sprint</i></li></ul> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=436942016-10-05T18:40:21ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Branch <code>9701-collection-pack-small-files-alt</code> ready for preliminary review @ <a class="changeset" title="9701: Changes on the Python SDK to allow small file packing on Collection class: * Added optional..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/480df25dc679998b53f9e7299244ac1ff3f90114">480df25dc679998b53f9e7299244ac1ff3f90114</a></p>
<p>Specific tests still pending. <br />Just want to check if the general idea where this is going is correct.<br />Thanks</p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=436992016-10-05T21:21:07ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Instead of a "closed" flag on ArvadosFile, there should be a reference count of the number of ArvadosFileWriters are active.</p>
<p>This docstring is out of sync with the code:</p>
<pre>
def set_closed(self):
"""Set current block as pending and closed flag to False"""
self._closed = True
self.parent._my_block_manager().repack_small_blocks()
</pre>
<p>When ArvadosFile size > config.KEEP_BLOCK_SIZE/2 it should flush on close and skip repack_small_blocks(). Current behavior looks like it will not combine blocks when the owner file is greater than config.KEEP_BLOCK_SIZE/2, but it will not commit those blocks either.</p>
<p>In repack_small_blocks(), I think we can wait until pending_write_size > config.KEEP_BLOCK_SIZE unless force is true (the current behavior looks like it will tend to create a lot of blocks which are just over the config.KEEP_BLOCK_SIZE/2 limit when they could have been packed more.) The loop which packs new blocks should also continue to add data until size+new data > config.KEEP_BLOCK_SIZE.</p>
<p>repack_small_blocks() should take a sync parameter to pass to self.commit_bufferblock() instead of always calling with sync=True.</p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=437282016-10-06T20:41:45ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>Instead of a "closed" flag on ArvadosFile, there should be a reference count of the number of ArvadosFileWriters are active.</p>
</blockquote>
<p>Added a set() to hold references to the ArvadosFileWriter objects that keep every ArvadosFile object open. I did it this way instead of just a simple counter so that every ArvadosFileWriter can remove itself once.</p>
<blockquote>
<p>When ArvadosFile size > config.KEEP_BLOCK_SIZE/2 it should flush on close and skip repack_small_blocks(). Current behavior looks like it will not combine blocks when the owner file is greater than config.KEEP_BLOCK_SIZE/2, but it will not commit those blocks either.</p>
<p>In repack_small_blocks(), I think we can wait until pending_write_size > config.KEEP_BLOCK_SIZE unless force is true (the current behavior looks like it will tend to create a lot of blocks which are just over the config.KEEP_BLOCK_SIZE/2 limit when they could have been packed more.) The loop which packs new blocks should also continue to add data until size+new data > config.KEEP_BLOCK_SIZE.</p>
<p>repack_small_blocks() should take a sync parameter to pass to self.commit_bufferblock() instead of always calling with sync=True.</p>
</blockquote>
<p>Updated to meet this requests. Also added a test to check for small blocks packing.</p>
<p>Branch ready at <a class="changeset" title="9701: Fixed a previous test to match new flush() behaviour. Added a new one to check for small bl..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/bf6ef981cea7e923a085c0a9231cebb379c7560a">bf6ef98</a><br />Test suite running at: <a class="external" href="https://ci.curoverse.com/job/developer-test-job/206/">https://ci.curoverse.com/job/developer-test-job/206/</a></p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=437642016-10-07T17:29:24ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Why does <code>add_writer()</code> check <code>isinstance(writer, ArvadosFileWriter)</code> but <code>remove_writer()</code> does not?</p>
<p><code>repack_small_blocks()</code> needs @synchronized.</p>
<p>We can simplify this:</p>
<pre>
small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner and b.owner.closed() and b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2)]
</pre>
<p>We should be able to make the following assumptions:</p>
<ul>
<li>A WRITABLE block always has an owner (if it doesn't, that's a bug).</li>
<li>If the block is WRITABLE and b.owner.closed() is True, we can assume b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2) because otherwise the block would have been put into PENDING state when it was flushed.</li>
</ul>
<p>So this can be:</p>
<pre>
small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner.closed()]
</pre>
<p>Change <code>self._bufferblocks = {}</code> to use <code>collections.OrderedDict</code>. This will pack the blocks in the order the buffer blocks were created, rather than the scrambled order when iterating over a regular dict.</p>
<p>This is noticeable in test_only_small_blocks_are_packed_together. The file "count.txt" is written first, but in the manifest "34603011:10:count.txt 34603008:3:foo.txt" we see that "foo.txt" was packed first (34603008 < 34603011). Without a stable packing order, this test could fail randomly.</p>
<p>In <code>repack_small_blocks()</code>, call <code>self.alloc_bufferblock()</code> instead of creating a new BufferBlock directly.</p>
<p>In <code>repack_small_blocks()</code>, instead of maintaining a separate "size" variable you can use new_bb.write_pointer.</p>
<p><code>alloc_bufferblock()</code> is using <code>len(self._bufferblocks)</code> to generate bufferblock ids, which seems like a bad idea as it could easily lead to block id reuse (this one is my fault). Change it to generate id with a random number or uuid.uuid4().</p>
<p>In <code>repack_small_blocks()</code>, it should recalculate <code>pending_write_size</code> at the bottom of the loop after <code>self.commit_bufferblock</code>.</p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=438062016-10-10T12:15:55ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>Why does <code>add_writer()</code> check <code>isinstance(writer, ArvadosFileWriter)</code> but <code>remove_writer()</code> does not?</p>
</blockquote>
<p>In a way, that's was the idea of keeping object references instead of a counter: Do the validations on the "add" part of the code, and every ArvadosFileWriter has its "self" pointer to remove itself from the list. One use that I can think of validations, is to log relevant error messages instead of crashing with a KeyError, do you think that would be desirable?</p>
<blockquote>
<p><code>repack_small_blocks()</code> needs @synchronized.</p>
</blockquote>
<p>Updated! I didn't include that because I was getting locked up when called from commit_all(), but moved that call out of the <code>with self.lock:</code> statement, and all is well now.</p>
<blockquote>
We can simplify this:<br />[...]<br />We should be able to make the following assumptions:
<ul>
<li>A WRITABLE block always has an owner (if it doesn't, that's a bug).</li>
<li>If the block is WRITABLE and b.owner.closed() is True, we can assume b.owner.size() <= (config.KEEP_BLOCK_SIZE / 2) because otherwise the block would have been put into PENDING state when it was flushed.</li>
</ul>
</blockquote>
<p>Nice simplification, thanks! I updated as requested and added comments for future reviewers of that region to ease code understanding.</p>
<blockquote>
<p>Change <code>self._bufferblocks = {}</code> to use <code>collections.OrderedDict</code>. This will pack the blocks in the order the buffer blocks were created, rather than the scrambled order when iterating over a regular dict.<br />This is noticeable in test_only_small_blocks_are_packed_together. The file "count.txt" is written first, but in the manifest "34603011:10:count.txt 34603008:3:foo.txt" we see that "foo.txt" was packed first (34603008 < 34603011). Without a stable packing order, this test could fail randomly.</p>
</blockquote>
<p>Updated!</p>
<blockquote>
<p>In <code>repack_small_blocks()</code>, call <code>self.alloc_bufferblock()</code> instead of creating a new BufferBlock directly.</p>
</blockquote>
<p>This gets me locked up because <code>repack_small_blocks()</code> is using the same lock as <code>self.alloc_bufferblock()</code>. The same happens with <code>delete_bufferblock()</code>. I could transform those two into a wrapper of its private counterpart, do you agree?</p>
<blockquote>
<p>In <code>repack_small_blocks()</code>, instead of maintaining a separate "size" variable you can use new_bb.write_pointer.</p>
</blockquote>
<p>Updated.</p>
<blockquote>
<p><code>alloc_bufferblock()</code> is using <code>len(self._bufferblocks)</code> to generate bufferblock ids, which seems like a bad idea as it could easily lead to block id reuse (this one is my fault). Change it to generate id with a random number or uuid.uuid4().</p>
</blockquote>
<p>Updated.</p>
<blockquote>
<p>In <code>repack_small_blocks()</code>, it should recalculate <code>pending_write_size</code> at the bottom of the loop after <code>self.commit_bufferblock</code>.</p>
</blockquote>
<p><code>pending_write_size</code> is used only once per call. Do you mean to add an outer loop to try to get more repackable small blocks after the first one? I think it's not necessary because <code>repack_small_blocks()</code> is called on every file being closed, so there shouldn't be a lot of small blocks with a combined size of more than a <code>KEEP_BLOCK_SIZE</code> hanging around."</p>
<p>Updates @ <a class="changeset" title="9701: Added clarifying comments to the small block searching list comprehension." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/31568ea331306a574be758fc60b090ecee3bc005">31568ea</a></p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=438322016-10-11T17:24:32ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Lucas Di Pentima wrote:</p>
<blockquote><blockquote>
<p>In <code>repack_small_blocks()</code>, call <code>self.alloc_bufferblock()</code> instead of creating a new BufferBlock directly.</p>
</blockquote>
<p>This gets me locked up because <code>repack_small_blocks()</code> is using the same lock as <code>self.alloc_bufferblock()</code>. The same happens with <code>delete_bufferblock()</code>. I could transform those two into a wrapper of its private counterpart, do you agree?</p>
</blockquote>
<p>Yes, I think that would be better. It avoids confusion over whether or not repack_small_blocks() is actually doing something different from alloc_bufferblock() and delete_bufferblock().</p>
<p>The rest LGTM.</p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=438352016-10-11T17:58:46ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Wrapped unsynchronized versions of <code>_alloc_bufferblock()</code> and <code>_delete_bufferblock()</code> into their synchronized counterparts.<br />Merged master into branch at: <a class="changeset" title="9701: Merge branch 'master' into 9701-collection-pack-small-files-alt" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/e892c7ee96f28bef7d5b2a9314eb9549ee56634d">e892c7ee96f28bef7d5b2a9314eb9549ee56634d</a><br />Running all tests before merging into master: <a class="external" href="https://ci.curoverse.com/job/developer-test-job/208/">https://ci.curoverse.com/job/developer-test-job/208/</a></p> Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blockshttps://dev.arvados.org/issues/9701?journal_id=438502016-10-11T23:25:05ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:c98596a46dc7c166acb2a26b742a09ae0493224a.</p>