https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422016-03-24T17:47:13ZArvadosArvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=369122016-03-24T17:47:13ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Subject</strong> changed from <i>[SDK] arv-put file packing breaks deduplication</i> to <i>[SDK] CollectionWriter file packing breaks deduplication</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/36912/diff?detail_id=36112">diff</a>)</li></ul> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=374032016-04-05T16:44:57ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> set to <i>Arvados Future Sprints</i></li></ul> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=375772016-04-07T20:56:15ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Going a bit further, we could also implement a rolling hash function to choose block boundaries during file upload, to improve deduplication even more:</p>
<p><a class="external" href="https://en.wikipedia.org/wiki/Rolling_hash">https://en.wikipedia.org/wiki/Rolling_hash</a></p>
<p><a class="external" href="https://crypto.stackexchange.com/questions/16082/cryptographically-secure-keyed-rolling-hash-function">https://crypto.stackexchange.com/questions/16082/cryptographically-secure-keyed-rolling-hash-function</a></p> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=378632016-04-13T19:29:10ZTom Cleggtom@curii.com
<ul><li><strong>Subject</strong> changed from <i>[SDK] CollectionWriter file packing breaks deduplication</i> to <i>[SDK] CollectionWriter file packing causes sub-optimal deduplication</i></li></ul> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=378852016-04-14T13:43:55ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to <code>arv-put</code>:</p>
<p><a class="external" href="https://dev.arvados.org/issues/8992">https://dev.arvados.org/issues/8992</a></p> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=379122016-04-14T22:54:43ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Grandi wrote:</p>
<blockquote>
<p>I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to <code>arv-put</code>:</p>
<p><a class="external" href="https://dev.arvados.org/issues/8992">https://dev.arvados.org/issues/8992</a></p>
</blockquote>
<p>I understand the current story description is technical enough that it's not easy to tell, but what this ticket proposes is to have the default behavior be at least close to the per-file behavior you request in <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: arv-put: option to create a "stream" per-file (default) or per-collection (Closed)" href="https://dev.arvados.org/issues/8992">#8992</a>. We might continue to combine "small" files into a single block (where the definition of "small" is TBD, and may be configurable, but the default threshold would probably be between 4 and 16 MiB), but we would make sure larger files get stored as an independent set of blocks, for the deduplication benefits you've talked about there and on IRC.</p>
<p>If that was the default behavior of <code>arv keep put</code>, would that be enough to obviate the need for <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: arv-put: option to create a "stream" per-file (default) or per-collection (Closed)" href="https://dev.arvados.org/issues/8992">#8992</a>?</p> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=379132016-04-14T22:57:15ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>CollectionWriter should ensure that large files are aligned to block boundaries</p>
</blockquote>
<p>Do we actually have to change the behavior of CollectionWriter? Or would it be sufficient to change the behavior of <code>arv keep put</code>—either by having it "flushing" CollectionWriter more often, or upgrading it to use the Collection class instead?</p>
<p>I'd rather just change <code>arv keep put</code>, and not muck with CollectionWriter. If there's some reason that's not good enough from a user perspective, it'd be good to know.</p> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=427662016-09-06T18:50:10ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Duplicate</i></li></ul><p>Addressed by <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: [SDKs] Python SDK Collection class should pack small files into large data blocks (Resolved)" href="https://dev.arvados.org/issues/9701">#9701</a></p> Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationhttps://dev.arvados.org/issues/8791?journal_id=472872017-01-18T06:07:47ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> deleted (<del><i>Arvados Future Sprints</i></del>)</li></ul>