https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422015-02-26T16:33:24ZArvadosArvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216462015-02-26T16:33:24ZBryan Coscabcosca@curii.com
<ul><li><strong>File</strong> <a href="/attachments/497">portablehash-2.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/497/portablehash-2.png">portablehash-2.png</a> added</li></ul><p>My current workaround is to select all the files and then create a new collection, which i guess makes a new hash. and I can use that new collection.</p>
<p><img src="https://dev.arvados.org/attachments/download/497/portablehash-2.png" alt="" /></p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216472015-02-26T18:18:59ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Subject</strong> changed from <i>Portable Data hash does not match computed hash</i> to <i>[Workbench] Can't copy collection between projects due to portable data hash mismatch</i></li><li><strong>Category</strong> set to <i>Workbench</i></li><li><strong>Target version</strong> set to <i>Bug Triage</i></li></ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216522015-02-26T18:29:41ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assigned To</strong> set to <i>Brett Smith</i></li></ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216532015-02-26T18:29:55ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> changed from <i>Bug Triage</i> to <i>2015-03-11 sprint</i></li></ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216562015-02-26T18:40:49ZBryan Coscabcosca@curii.com
<ul></ul><p><a href="https://arvadosapi.com/qr1hi-4zz18-yrocuypi31jo3ru">qr1hi-4zz18-yrocuypi31jo3ru</a> and <a href="https://arvadosapi.com/qr1hi-4zz18-h03wet37z45z2c9">qr1hi-4zz18-h03wet37z45z2c9</a> are known culprits</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216582015-02-26T19:21:43ZBrett Smithbrett.smith@curii.com
<ul></ul><p>The affected collections have manifests with Keep location hints in them (<code>+Kqr1hi</code>), and the portable data hash corresponds to a manifest text including those hints. Now that API server disregards all hints when calculating a portable data hash, a straightforward copy fails.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=216702015-02-26T22:52:55ZBrett Smithbrett.smith@curii.com
<ul></ul><p>I believe <a href="https://arvados.org/projects/arvados/repository/revisions/75a07d8175c9ee9a1c212cd1abe1c0d64a91cfc7/diff/services/api/app/models/collection.rb" class="external">this change is our culprit</a>, specifically on old line 87/new line 106. That's validation-related code that changed from using self.manifest_text to portable_manifest_text. portable_manifest_text strips all hints, so that would explain the difference.</p>
<p>Questions:</p>
<ul>
<li>What's the right policy here? Should the PDH correspond to the text without any hints, or just without access tokens? Tom made this change, so I'm guessing it's the former…</li>
<li>Once we figure that out, do we need a migration to correct the PDHs of older collections?</li>
</ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=217052015-02-27T20:05:41ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Subject</strong> changed from <i>[Workbench] Can't copy collection between projects due to portable data hash mismatch</i> to <i>[API] Portable data hash mismatches for collections with location hints (+K)</i></li><li><strong>Category</strong> changed from <i>Workbench</i> to <i>API</i></li></ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=217232015-03-02T15:07:53ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>+1 to doing a migration. It should probably include updating the PDH of the collection in pipelines templates.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=217512015-03-02T19:16:19ZTom Cleggtom@curii.com
<ul></ul><p>Yes, PDH should stay the same when the hints change (e.g., when the data comes and goes from different storage systems, and when the manifest is copied between instances).</p>
<p>As for cleaning up the existing incorrect PDH values:</p>
<p>A migration that <em>changes</em> the PDH values would make the old/broken PDH unfindable, which means either editing evidence about existing jobs and pipeline instances (yuck) or making them un-repeatable (also yuck, although less so). Perhaps we can make copies of the affected collections instead, so we use the new PDH for future work but the old ones can still be looked up. The remaining challenge would be to prevent the old/bogus copies from showing up as confusing duplicates in searches.</p>
<p>(A "copy" API call would help address this (and, perhaps more importantly, <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: [Workbench] Avoid passing manifest_text around the network too much when updating and merging lar... (Resolved)" href="https://dev.arvados.org/issues/5096">#5096</a>), but seems like excessive creep to do here, and doesn't address the need for a migration to fix the bogus PDH values.)</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=217582015-03-02T20:12:43ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>A migration that <em>changes</em> the PDH values would make the old/broken PDH unfindable, which means either editing evidence about existing jobs and pipeline instances (yuck) or making them un-repeatable (also yuck, although less so). Perhaps we can make copies of the affected collections instead, so we use the new PDH for future work but the old ones can still be looked up. The remaining challenge would be to prevent the old/bogus copies from showing up as confusing duplicates in searches.</p>
</blockquote>
<p>The longer-term plan is for Workbench to ignore objects that expire in the future (showing them only in a Trash), right? If that's the case, what if the migration set expires_at for the affected Collections far in the future, like 2038? That way they'd still be easy to find for jobs, and we have a way to hide them in Workbench that's consistent with other plans.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=217772015-03-02T22:11:01ZTom Cleggtom@curii.com
<ul></ul><p>Brett Smith wrote:</p>
<blockquote>
<p>The longer-term plan is for Workbench to ignore objects that expire in the future (showing them only in a Trash), right? If that's the case, what if the migration set expires_at for the affected Collections far in the future, like 2038? That way they'd still be easy to find for jobs, and we have a way to hide them in Workbench that's consistent with other plans.</p>
</blockquote>
<p>Yes, that sounds like it should work nicely.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=218772015-03-05T14:38:36ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>General comment: my understanding is that best practices for migrations to modify the tables directly using SQL, instead of using models, since the model code can change to no longer match the assumptions at the time the migration was written. However, since most of the time the migrations are run promptly, this hasn't been much of an issue in practice.</p>
<ul>
<li>Since existing Collection records are getting updated, perhaps it should be writing updates to the log table?</li>
<li>Why does each_bad_collection search for tags matching <code>\+[B-Z]</code> and not <code>\+[A-Z]</code>?</li>
<li>Have you tried this out on a test database?</li>
</ul> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=219752015-03-06T16:37:55ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>General comment: my understanding is that best practices for migrations to modify the tables directly using SQL, instead of using models, since the model code can change to no longer match the assumptions at the time the migration was written. However, since most of the time the migrations are run promptly, this hasn't been much of an issue in practice.</p>
</blockquote>
<p>I have reworked the migration to <a href="http://guides.rubyonrails.org/v3.2.21/migrations.html#using-models-in-your-migrations" class="external">follow the Rails Guide's advice about this</a>.</p>
<blockquote>
<ul>
<li>Since existing Collection records are getting updated, perhaps it should be writing updates to the log table?</li>
</ul>
</blockquote>
<p>Sure, added.</p>
<blockquote>
<ul>
<li>Why does each_bad_collection search for tags matching <code>\+[B-Z]</code> and not <code>\+[A-Z]</code>?</li>
</ul>
</blockquote>
<p>The API server has been pretty good for a while now about making sure that manifests don't get saved with permission hints. But it can't hurt to be sure, so changed.</p>
<blockquote>
<ul>
<li>Have you tried this out on a test database?</li>
</ul>
</blockquote>
<p>Yes. The easiest way I've found is to add this test collection fixture:</p>
<pre>with_location_hints:
owner_uuid: <a href="https://arvadosapi.com/zzzzz-tpzed-xurymjxw79nv3jz">zzzzz-tpzed-xurymjxw79nv3jz</a>
created_at: 2014-02-03 17:22:54.000000000 Z
modified_by_client_uuid: <a href="https://arvadosapi.com/zzzzz-ozdt8-brczlopd8u8d0jr">zzzzz-ozdt8-brczlopd8u8d0jr</a>
modified_by_user_uuid: <a href="https://arvadosapi.com/zzzzz-tpzed-d9tiejq69daie8f">zzzzz-tpzed-d9tiejq69daie8f</a>
modified_at: 2014-02-03 17:22:54.000000000 Z
portable_data_hash: 3eab909704896fd5e1c70cd135ab6b8d+52
updated_at: 2014-02-03 17:22:54.000000000 Z
uuid: <a href="https://arvadosapi.com/bcsdv-4zz18-6pxgxibl4vmyt09">bcsdv-4zz18-6pxgxibl4vmyt09</a>
manifest_text: |
. 37b51d194a7513e45b56f6524f2d51f2+3+Kfhqwh 0:3:bar
name: manifest with location hints
description:
properties: {}
</pre>
<p>With that in place, you can <code>db:fixtures:load</code>, then <code>db:migrate</code> and <code>db:rollback</code> and observe the results.</p>
<p>Now at <a class="changeset" title="5319: Fixup migration UUID generation code." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cea7dc41b15938bd7d2fb3851b7b7c434fea2ed0">cea7dc4</a>. Thanks.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=220712015-03-09T13:12:18ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>I noticed that it doesn't actually update the contents of <code>manifest_text</code>. Is that intentional or an oversight?</p>
<p><code>modified_at</code> also isn't updated, and the copied collection has <code>created_at</code> copied instead of set to now. This might be the right thing to do, but I'd like to hear your rationale.</p>
<p>Everything else looks good.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=220792015-03-09T15:29:45ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>I noticed that it doesn't actually update the contents of <code>manifest_text</code>. Is that intentional or an oversight?</p>
</blockquote>
<p>Intentional: that's what the API server is currently doing. See <code>strip_manifest_text</code> in the Collection model: the only hints it strips are signatures. You can also see <a href="https://arvadosapi.com/qr1hi-4zz18-cs8ezglaomoym37">qr1hi-4zz18-cs8ezglaomoym37</a>, which is the collection created by creating a "new" collection with all the same files as the original: its manifest text still has the hints. And this makes sense to me. This way, the portable data hash stays the same as long as the data does, but we retain hint information as much as possible to help find that data faster.</p>
<blockquote>
<p><code>modified_at</code> also isn't updated, and the copied collection has <code>created_at</code> copied instead of set to now. This might be the right thing to do, but I'd like to hear your rationale.</p>
</blockquote>
<p>I guess the question boils down to, should those timestamps be true to the literal database record, or should they be true to the underlying collection? From the database perspective, I think your ideas are the right ones. I was thinking about it more at the collection level: the <em>collection</em> isn't new or changed, we've just patched up how we address it in the database. And so the logic reflects that.</p>
<p>To the extent that this is a bugfix migration that we're trying to keep invisible to the user (and admittedly, that's still not very invisible), I think this approach supports that goal. Imagine a user doing API requests for collections ordered by one of these timestamps descending. I think it would surprise them to see "old" collections suddenly spring up to the top of those results. We can't hide the second record from them, but I think this approach puts it in a context that can make it easier to see where it's coming from.</p>
<p>Thanks.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=220812015-03-09T15:31:29ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Thanks. LGTM</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=220822015-03-09T15:40:09ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:19656a41f019488120f950b06ecf9e19074b11a3.</p> Arvados - Bug #5319: [API] Portable data hash mismatches for collections with location hints (+K)https://dev.arvados.org/issues/5319?journal_id=222092015-03-11T16:51:40ZWard Vandewegeward@curii.com
<ul><li><strong>Story points</strong> set to <i>0.5</i></li></ul>