https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422016-04-05T07:47:07ZArvadosArvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=373782016-04-05T07:47:07ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>Oop I forgot this important detail, which is the dry-run report of Data Manager just after the garbage collection:</p>
<pre>
2016/03/24 17:45:21 Read and processed 417 collections
2016/03/24 17:45:22 Blocks In Collections: 514668,
Blocks In Keep: 514668.
2016/03/24 17:45:22 Replication Block Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 1650,
Replicated Just Right: 513018,
Not In Any Collection: 0.
Replication Collection Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 11,
Replicated Just Right: 406.
2016/03/24 17:45:22 Blocks Histogram:
2016/03/24 17:45:22 {Requested:1 Actual:1}: 513018
2016/03/24 17:45:22 {Requested:1 Actual:2}: 1647
2016/03/24 17:45:22 {Requested:1 Actual:3}: 3
</pre>
<p>Between then and now we have (almost) only uploaded 6x ~4TiB data collections, two of them twice (so 8x uploads) and then deleted the earlier uploads of the two that were twice uploaded.<br />IIRC.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=373792016-04-05T08:23:06ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>So sequence of (almost all) events:</p>
<ul>
<li>We have a 30TiB Keep with around 25TiB of data, we add another 30TiB, we upload an 8TiB collection.</li>
<li>We then upload a 24TiB collection, and that registers as 100% complete but registering the collection manifest fails because of out-of-RAM.</li>
<li>At this point the 60TiB total is almost full.</li>
<li>We try again to upload the 24TiB data set as 6x 4TiB collections, counting on the hash blocks being reused. But that does not happen much, so we get to 100% space used, with around 24TiB in unreferenced blocks as expected. Of the 6x collection 4 report failure to upload because of out-of-space, and 2x (collections 2 and 3) report a successful upload, which is strange, as we did not quite have 8TiB free.</li>
<li>Run the Data Manager in dry-run mode, and some local scripts, and roughly 1.2TiB of blocks are reported as unnecessary duplicates. I write a script to delete the relevant files and the Data manager in dry-run mode reports no problems. The 1.2TiB free allows us to continue using Keep for some test jobs.</li>
<li>After consultation it turns out that hashes are computed on a bytestream, not per-file, by <code>arv-put</code>, so the re-upload in 6 subsets resulted in different hashes (except for the first I guess).</li>
<li>So I run the Data Manager in garbage collection mode and that seems successful. It seems to me to free up a bit too much space, but then we had deleted some collections over the past few months.</li>
<li>So I run the Data Manager in dry-run mode and the reports looks good.</li>
<li>We re-upload subsets 1, 4, 5, 6 of the 6x and all report successfully completed and registered.</li>
<li>We re-upload subsets 2 and 3 to sligthly differently named collections. Curiously even if the list of files passed to <code>arv-put</code> is exactly the same as that for the previous upload, this consumes some space.</li>
<li>We delete the collections that were the original uploads of subsets 2 and 3. Out server space graphs don't change.</li>
<li>The Data Manager in dry-run mode reports 4TiB of missing blocks and 2.5TiB of duplicate blocks.</li>
</ul>
<p>Related issues: <a class="external" href="https://dev.arvados.org/issues/8769">https://dev.arvados.org/issues/8769</a> <a class="external" href="https://dev.arvados.org/issues/8867">https://dev.arvados.org/issues/8867</a></p>
<p>I am about to upload a free-space graph for the relevant server volumes that shows the various phases.</p>
<p>Note: some trivial formatting mistakes fixed 2016-04-08.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=373802016-04-05T08:24:24ZPeter Grandipg@arvados.for.sabi.co.uk
<ul><li><strong>File</strong> <a href="/attachments/1115">160329_arvDiskFree.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1115/160329_arvDiskFree.png">160329_arvDiskFree.png</a> added</li></ul><p>Free space graph 2016-03-29 (a week ago, just after Easter).</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=375692016-04-07T19:33:55ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Hi Peter,</p>
<blockquote>
<p>collections 2 and 3 report a successful upload, which is strange, as we did not quite have 8TiB free.</p>
</blockquote>
<p>This sounds like collections 2 and 3 have more deduplication than the other batches. (<a class="issue tracker-6 status-7 priority-4 priority-default closed" title="Idea: [SDK] CollectionWriter file packing causes sub-optimal deduplication (Duplicate)" href="https://dev.arvados.org/issues/8791">#8791</a> is the story to improve the deduplication behavior generally).</p>
<blockquote>
<p>We re-upload subsets 2 and 3 to sligthly differently named collections.</p>
</blockquote>
<p>Do you mean just "arv-put --name" is different?</p>
<blockquote>
<p>We delete the collections that were the original uploads of subsets 2 and 3. Out server space graphs don't change.</p>
</blockquote>
<p>When you write "delete the collections" do you mean using the "trash" button in Workbench, using "arv collection delete", or are you directly deleting the collection record in the Postgres console?</p>
<p>Does "delete the collections" include running Data Manager garbage collection and/or deleting blocks using your own script?</p>
<p>I'm trying to understand the sequence of events better:</p>
<ol>
<li>You "delete the collections" (assuming this does not include running data manager GC or any other scripts that would delete actual data)</li>
<li>"server space graphs don't change" which also suggests that nothing changed on disk</li>
<li>running data manager immediately afterwards is reporting "Missing From Keep" blocks when you would actually expect "Not In Any Collection"?</li>
</ol>
<p>Are you suggesting that files are reported missing when nothing was physically deleted? That would be extremely puzzling. If that is the case, the most likely reason would be if data manager failed to contact one of the keep servers to get a block index, but instead of failing with an error it just went ahead. However, that wouldn't explain why the over replicated block count would go up as well.</p>
<p>Does data manager return the same results over multiple runs?</p>
<blockquote>
<p>A question I have is whether there is a tool that can tell me which collections and files within them have missing hashes. I think that I can easily modify some of my scripts to that purpose, so I would like to know if there is a tool that I can use as a double check.</p>
</blockquote>
<p>Data Manager has most of that information, but doesn't currently have the feature to directly report missing blocks and tell you exactly which collections and files are affected.</p>
<blockquote>
<p>The other question is whether I can run with Data Manager further consistency checks, for example as to verifying the hashes of the data blocks.</p>
</blockquote>
<p>This is being worked on right now: <a class="issue tracker-6 status-3 priority-4 priority-default closed parent" title="Idea: [Keep] Block validation script (Resolved)" href="https://dev.arvados.org/issues/8724">#8724</a></p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=375822016-04-07T23:24:10ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Also, this whole mess could have been avoided if arv-put saved the manifest text from failed collection create API call to facilitate forensic recovery. I've added a story for this at <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: [SDK] arv-put should save manifest text on API error (Resolved)" href="https://dev.arvados.org/issues/8910">#8910</a></p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=375902016-04-08T08:34:00ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><blockquote><blockquote>
<p>collections 2 and 3 report a successful upload, which is strange, as we did not quite have 8TiB free.</p>
</blockquote>
<p>This sounds like collections 2 and 3 have more deduplication than the other batches.</p>
</blockquote>
<p>I don't know what the chances of that happening are, because the 25TB dataset is "DDD", which is 3,000 ".bam" files plus their ".bai" files, and <code>file</code> tells me that they are gzipped. Also, since they are uploaded as "bytestreams" even the same data at the same offset in the file might have different hashes because of different offsets in the "bytestream". Then there is the chance that two compressed-data blocks of 64MiB have by coincidence the same hash.</p>
<blockquote><blockquote>
<p>We re-upload subsets 2 and 3 to sligthly differently named collections.</p>
</blockquote>
<p>Do you mean just "arv-put --name" is different?</p>
</blockquote>
<p>Yes, identical command lines from the same shell history with "_2b" and "_3b" instead of "_2" and "_3" as name suffix.</p>
<blockquote><blockquote>
<p>We delete the collections that were the original uploads of subsets 2 and 3. Out server space graphs don't change.</p>
</blockquote>
<p>When you write "delete the collections" do you mean using the "trash" button in Workbench</p>
</blockquote>
<p>Workbench trash.</p>
<blockquote>
<p>Does "delete the collections" include running Data Manager garbage collection and/or deleting blocks using your own script?</p>
</blockquote>
<p>No, the space graph should show that.</p>
<blockquote>
<p>I'm trying to understand the sequence of events better:</p>
</blockquote>
<blockquote>
<ol>
<li>You "delete the collections" (assuming this does not include running data manager GC or any other scripts that would delete actual data)</li>
</ol>
</blockquote>
<p>Yes, just Workbench trash. Unfortunately not expecting anything strange I did not run dry-run Data Manager just before that. I run it just after that to check whether there were any unreferenced block hashes, which I did not expect.</p>
<blockquote>
<ol>
<li>"server space graphs don't change" which also suggests that nothing changed on disk</li>
</ol>
</blockquote>
<p>That's what I suppose. But the space taken by the collections on disk is lower than what I would expect; the attached space graph should show that.</p>
<blockquote>
<ol>
<li>running data manager immediately afterwards is reporting "Missing From Keep" blocks when you would actually expect "Not In Any Collection"?</li>
</ol>
</blockquote>
<p>Reports ~4TB (65,760 blocks) "Missing from Keep", and ~2.5TB (41,177 blocks) of duplicates. I can imagine the duplicates happen because of placement issues during upload. It is the 4TB of missing blocks that perplex me, as between the garbage collection on the 24th and now we have only done uploads and no garbage collections. The attached space graph hows that.</p>
<blockquote>
<p>Are you suggesting that files are reported missing when nothing was physically deleted?</p>
</blockquote>
<p>Yes.</p>
<blockquote>
<p>That would be extremely puzzling. If that is the case, the most likely reason would be if data manager failed to contact one of the keep servers to get a block index, but instead of failing with an error it just went ahead. However, that wouldn't explain why the over replicated block count would go up as well.</p>
</blockquote>
<p>That is one of the possibilities that occurred to me too. So my plan, which has not happened yet, was to match the manifests in the SQL database with the find of hash files in the keepstores, which I gathered using 'find' on each of them.</p>
<p>My current best "hunch" is that the problem happened during the upload of the original "_2" and "_3" collections, when the Keepstore filetrees were 100% full. Perhaps the <code>keepstore</code> daemon does not handle that too well.</p>
<blockquote>
<p>Does data manager return the same results over multiple runs?</p>
</blockquote>
<p>Yes.</p>
<blockquote><blockquote>
<p>A question I have is whether there is a tool that can tell me which collections and files within them have missing hashes. I think that I can easily modify some of my scripts to that purpose, so I would like to know if there is a tool that I can use as a double check.</p>
</blockquote></blockquote>
<blockquote>
<p>Data Manager has most of that information, but doesn't currently have the feature to directly report missing blocks and tell you exactly which collections and files are affected.</p>
</blockquote>
<p>A "-verbose" flag would be welcome.</p>
<blockquote><blockquote>
<p>The other question is whether I can run with Data Manager further consistency checks, for example as to verifying the hashes of the data blocks.</p>
</blockquote></blockquote>
<blockquote>
<p>This is being worked on right now: <a class="issue tracker-6 status-3 priority-4 priority-default closed parent" title="Idea: [Keep] Block validation script (Resolved)" href="https://dev.arvados.org/issues/8724">#8724</a></p>
</blockquote>
<p>As to that the idea that file hashes are computed on a "bytestream" rather than file base in <code>arv-put</code> by default (which cannot be changed) is a bit disappointing, because it would be nice to be able to compute similar block hashes outside Keep and double check them.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376662016-04-10T20:34:13ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>File</strong> <a href="/attachments/1124">datamanager</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1124/datamanager">datamanager</a> added</li><li><strong>File</strong> <a href="/attachments/1123">keep_block_to_file.py</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1123/keep_block_to_file.py">keep_block_to_file.py</a> added</li></ul><p>To help you recover while we continue to try and diagnose the underlying bug, I've added some additional reporting to datamanager, along with an auxiliary python script. These are in the <a href="https://dev.arvados.org/projects/arvados/repository?rev=8912-missing-blocks-report" class="external">8912-missing-blocks-report branch</a> and for your convenience I've attached a binary of <code>datamanager</code> and a copy of the python script <code>keep_block_to_file.py</code> to this ticket.</p>
<p>Running <code>datamanager -dry-run -extra-reports</code> will produce some timestamped files, the formats are <code>timestamp_uuid_index.txt</code> and <code>timestamp_uuid_missing.txt</code>. The former is the indexes returned by each keepstore to datamanager, the latter is the collections with missing blocks.</p>
<p>You then use <code>keep_block_to_file.py *_missing.txt</code> to get the list of specific files within each collection which have missing blocks.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376702016-04-11T10:11:49ZPeter Grandipg@arvados.for.sabi.co.uk
<ul><li><strong>File</strong> <a href="/attachments/1125">160404_arvDiskFreeNotes.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1125/160404_arvDiskFreeNotes.png">160404_arvDiskFreeNotes.png</a> added</li></ul><p>Attached a copy of the freespace graph with notes.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376712016-04-11T13:41:01ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>Just run the extended {{datamanager}}:</p>
<pre>
manager@hebrides:~$ ls -ld *index.txt *missing.txt
-rw-rw-r-- 1 manager manager 815976 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-4zz18-2ob6abc4jcp8bzy_missing.txt
-rw-rw-r-- 1 manager manager 1131438 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-4zz18-idu3dl9xvky1rke_missing.txt
-rw-rw-r-- 1 manager manager 814506 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-4zz18-u88ijgfjz8y5cjh_missing.txt
-rw-rw-r-- 1 manager manager 2363775 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-1j5ilfcaic0ude8_index.txt
-rw-rw-r-- 1 manager manager 4140699 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-3bbqgnonzaux8k0_index.txt
-rw-rw-r-- 1 manager manager 4179572 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-cga8vx6ihr6rq8a_index.txt
-rw-rw-r-- 1 manager manager 3781747 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-h8s8tzzsldt1kqh_index.txt
-rw-rw-r-- 1 manager manager 3784458 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-l3i1sdlfr38jjtn_index.txt
-rw-rw-r-- 1 manager manager 2354311 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-pkp2i499ekxrrtk_index.txt
-rw-rw-r-- 1 manager manager 2371703 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-sztrsz43hs9w28d_index.txt
-rw-rw-r-- 1 manager manager 2367525 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-tkk9ygrftp95d8r_index.txt
-rw-rw-r-- 1 manager manager 2354036 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-vr8fxkbqu84w2um_index.txt
-rw-rw-r-- 1 manager manager 4133209 Apr 11 13:34 2016-04-11T13:34:34Z_gcam1-bi6l4-z7kt05yz1ng7f5w_index.txt
</pre><br /><pre>
manager@hebrides:~$ wc -l *missing.txt
19428 2016-04-11T13:34:34Z_gcam1-4zz18-2ob6abc4jcp8bzy_missing.txt
26939 2016-04-11T13:34:34Z_gcam1-4zz18-idu3dl9xvky1rke_missing.txt
19393 2016-04-11T13:34:34Z_gcam1-4zz18-u88ijgfjz8y5cjh_missing.txt
65760 total
</pre> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376722016-04-11T13:48:16ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>I had expected "_2b" and "_3b" (renamed to "-103001-104929" and "-104930-106319") but it is instead "_1" (renamed to "-100004-102995") and "_4" and "_5":</p>
<pre> uuid | name
-----------------------------+---------------------------------------
<a href="https://arvadosapi.com/gcam1-4zz18-u88ijgfjz8y5cjh">gcam1-4zz18-u88ijgfjz8y5cjh</a> | DDD_WGS_EGAD00001001114_4
<a href="https://arvadosapi.com/gcam1-4zz18-2ob6abc4jcp8bzy">gcam1-4zz18-2ob6abc4jcp8bzy</a> | DDD_WGS_EGAD00001001114_5
<a href="https://arvadosapi.com/gcam1-4zz18-idu3dl9xvky1rke">gcam1-4zz18-idu3dl9xvky1rke</a> | DDD_WGS_EGAD00001001114-100004-102995
(3 rows)</pre>
<pre>arvados=# select uuid,name from collections where name like 'DDD%' order by name;
uuid | name
-----------------------------+---------------------------------------
<a href="https://arvadosapi.com/gcam1-4zz18-6m5xpktrmjsl9p1">gcam1-4zz18-6m5xpktrmjsl9p1</a> | DDD_VARINFO_2015-08-14
<a href="https://arvadosapi.com/gcam1-4zz18-idu3dl9xvky1rke">gcam1-4zz18-idu3dl9xvky1rke</a> | DDD_WGS_EGAD00001001114-100004-102995
<a href="https://arvadosapi.com/gcam1-4zz18-4lrd2smmw9u6yni">gcam1-4zz18-4lrd2smmw9u6yni</a> | DDD_WGS_EGAD00001001114-103001-104929
<a href="https://arvadosapi.com/gcam1-4zz18-gbf33ha80xmcxk6">gcam1-4zz18-gbf33ha80xmcxk6</a> | DDD_WGS_EGAD00001001114-104930-106319
<a href="https://arvadosapi.com/gcam1-4zz18-oytr46viwy7ikxj">gcam1-4zz18-oytr46viwy7ikxj</a> | DDD_WGS_EGAD00001001114_2
<a href="https://arvadosapi.com/gcam1-4zz18-0yjqcuawjvntzaq">gcam1-4zz18-0yjqcuawjvntzaq</a> | DDD_WGS_EGAD00001001114_3
<a href="https://arvadosapi.com/gcam1-4zz18-u88ijgfjz8y5cjh">gcam1-4zz18-u88ijgfjz8y5cjh</a> | DDD_WGS_EGAD00001001114_4
<a href="https://arvadosapi.com/gcam1-4zz18-2ob6abc4jcp8bzy">gcam1-4zz18-2ob6abc4jcp8bzy</a> | DDD_WGS_EGAD00001001114_5
<a href="https://arvadosapi.com/gcam1-4zz18-pxfnfhjpx218suk">gcam1-4zz18-pxfnfhjpx218suk</a> | DDD_WGS_EGAD00001001114_6
(9 rows)</pre> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376812016-04-11T14:20:08ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>On 2016-03-29 {{arv-put}} reported success with:</p>
<pre>librarian@biscay$ time arv-put --replication 1 --resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-k25rlhe6ig8p9na">gcam1-j7d0g-k25rlhe6ig8p9na</a> --name DDD_WGS_EGAD00001001114_1 $(< ~/l1)
arv-put: Resuming previous upload from last checkpoint.
Use the --no-resume option to start over.
4282537M / 4282637M 100.0%
Collection saved as 'DDD_WGS_EGAD00001001114_1'
<a href="https://arvadosapi.com/gcam1-4zz18-idu3dl9xvky1rke">gcam1-4zz18-idu3dl9xvky1rke</a>
real 3439m58.024s
user 687m21.092s
sys 330m3.305s</pre>
<pre>librarian@sole$ time arv-put --replication 1 --resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-k25rlhe6ig8p9na">gcam1-j7d0g-k25rlhe6ig8p9na</a> --name DDD_WGS_EGAD00001001114_4 $(< ~/l4)
arv-put: Resuming previous upload from last checkpoint.
Use the --no-resume option to start over.
4228156M / 4228192M 100.0%
Collection saved as 'DDD_WGS_EGAD00001001114_4'
<a href="https://arvadosapi.com/gcam1-4zz18-u88ijgfjz8y5cjh">gcam1-4zz18-u88ijgfjz8y5cjh</a>
real 3979m16.365s
user 864m45.066s
sys 382m57.657s</pre>
<pre>
2016-03-27 14:42:04 arvados.keep[13012] DEBUG: Request: PUT http://keep8.gcam1.camdc.genomicsplc.com:25107/c9804a232cd8802a4315a879d701c6f2
2016-03-27 14:42:04 arvados.keep[13012] INFO: PUT 200: 3307667 bytes in 23.0910778046 msec (136.608 MiB/sec)
2016-03-27 14:42:04 arvados.keep[13012] DEBUG: KeepWriterThread <KeepWriterThread(Thread-479423, started 139758258013952)> succeeded c9804a232cd8802a4315a879d701c6f2+3307667 http://keep8.gcam1.camdc.genomicsplc.com:25107/
4311611M / 4311683M 100.0%
Collection saved as 'DDD_WGS_EGAD00001001114_5'
<a href="https://arvadosapi.com/gcam1-4zz18-2ob6abc4jcp8bzy">gcam1-4zz18-2ob6abc4jcp8bzy</a>
real 4137m27.994s
user 867m54.495s
sys 364m37.592s</pre>
<pre>librarian@sole$ du -sm DDD*_[1-6]
4282638 DDD_WGS_EGAD00001001114_1
4286404 DDD_WGS_EGAD00001001114_2
4525616 DDD_WGS_EGAD00001001114_3
4228194 DDD_WGS_EGAD00001001114_4
4311684 DDD_WGS_EGAD00001001114_5
4321623 DDD_WGS_EGAD00001001114_6</pre> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376822016-04-11T14:41:32ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>The ~4,000 minutes reported for the {{arv-put}}s above are nearly 3 days, and the uploads were started March 24th, so they finished during the weekend as the free space graph shows.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=376842016-04-11T14:52:41ZPeter Grandipg@arvados.for.sabi.co.uk
<ul><li><strong>File</strong> <a href="/attachments/1127">2016-04-11T13_34_34Z_filescount.txt</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1127/2016-04-11T13_34_34Z_filescount.txt">2016-04-11T13_34_34Z_filescount.txt</a> added</li></ul><pre>manager@hebrides:~$ python2 160410_keep_block_to_file.py *_missing.txt >| 2016-04-11T13:34:34Z_files.txtmanager@hebrides:~$ cut -d, -f 2 2016-04-11T13:34:34Z_files.txt | sort | uniq -c >| 2016-04-11T13:34:34Z_filescount.txt
manager@hebrides:~$ wc -l 2016-04-11T13:34:34Z_filescount.txt
1101 2016-04-11T13:34:34Z_filescount.txt
manager@hebrides:~$ wc -l 2016-04-11T13:34:34Z_files.txt
66858 2016-04-11T13:34:34Z_files.txt
</pre> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=378802016-04-14T12:14:31ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>My impressions from a long IRC discussion:</p>
<ul>
<li><code>arv-put</code> by default uses option <code>--resume</code> and that means that it keeps a history of blocks that has already uploaded in previous partial uploads.</li>
<li>The blocks in that list are deemed, when <code>--resume</code> is used, to be present on the Keep servers without any further checks.</li>
<li>If a garbage collection intervenes and those blocks are not shared with already registered manifests, they are going to be deleted, but <code>arv-put</code>, since it does not check, will not know.</li>
<li>There is a partial protection against mishaps in that each block in the list kept by <code>arv-put</code> is tagged by the garbage-collector TTL/delayed-delete time, so <code>arv-put</code> won't consider it present on Keep after that time.</li>
<li>The garbage-collector TTL/delayed-delete time is sort of the same as that described under permissions in: <a class="external" href="https://dev.arvados.org/projects/arvados/wiki/Keep_server#Permission">https://dev.arvados.org/projects/arvados/wiki/Keep_server#Permission</a>, that is <code>arv-put</code> will assume that a block is going to be available for as long as its permissions tocken lasts.</li>
<li>But if the garbage-collector delayed-delete time is reduced in-between a failed upload and its resumption, <code>arv-put</code> will believe the delayed-delete time associated with the permissions token at upload time, not the current delayed-delete time.</li>
</ul>
<p>While each aspect of the previous story makes sense on its own, my impression is that it means overall that Keep is no longer stateless/idempotent, but there is state outside it that might become stale and yet its currency is not verified. It is a classic case of treating a hint value as if it were a cached value.</p>
<p>Overall I think that this is the result of attempting to optimize <code>arv-put</code> which is both a very critical tool, and has some subtle state and logic inside. If there was a way to treat the <code>--resume</code> list as a hint to be verified it would be safer.</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=378812016-04-14T12:22:24ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>Anyhow, the story above seems to validate my impressions that the missing data had not been lost, but never uploaded, given the free space graphs showing unlikely amounts of deduplication, and the Data Manager showing a good state just after garbage collection, and eventually the list of files with missing blocks showed that they were all at the beginning of the list of files for each sub-collection.</p>
<p>So a re-re-re-upload with <code>arv-put --no-resume</code> (to be sure) and exactly the same file list (because of <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: re-upload seems to consume a lot of space (Resolved)" href="https://dev.arvados.org/issues/8769">#8769</a>) previously used should have resulted in the missing blocks being uploaded and all other blocks being found as present already.</p>
<p>Currently the re-re-re-upload of subcollections 1, 4, 5 has reached nearly 60% and indeed the Data manger in <code>-dry-run</code> mode reports a wholesome situation:</p>
<pre>
2016/04/14 10:38:27 Returned 10 keep disks
2016/04/14 10:38:27 Replication level distribution: map[1:741370 2:41177 3:3]
2016/04/14 10:38:29 Blocks In Collections: 782550,
Blocks In Keep: 782550.
2016/04/14 10:38:29 Replication Block Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 41180,
Replicated Just Right: 741370,
Not In Any Collection: 0.
Replication Collection Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 13,
Replicated Just Right: 416.
2016/04/14 10:38:29 Blocks Histogram:
2016/04/14 10:38:29 {Requested:1 Actual:1}: 741370
2016/04/14 10:38:29 {Requested:1 Actual:2}: 41177
2016/04/14 10:38:29 {Requested:1 Actual:3}: 3
</pre> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=378862016-04-14T13:45:40ZPeter Grandipg@arvados.for.sabi.co.uk
<ul></ul><p>I have written a feature request for 3 instead of 2 modes of "resumption" here, to include a mode where block presence is checked at the moment of upload, instead of assumed from a previous upload:</p>
<p><a class="external" href="https://dev.arvados.org/issues/8993">https://dev.arvados.org/issues/8993</a></p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=411122016-07-15T21:23:03ZTom Morristfmorris@veritasgenetics.com
<ul></ul><p>Do the new feature requests stories capture everything from here? ie can this issue be safely closed?</p> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=494822017-03-15T15:12:56ZTom Cleggtom@curii.com
<ul><li><strong>Priority</strong> changed from <i>Urgent</i> to <i>Normal</i></li></ul> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=750102019-05-30T17:54:34ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/8993">Feature #8993</a>: arv-put: options for 3 modes of "resumption"</i> added</li></ul> Arvados - Bug #8878: Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/8878?journal_id=810072020-01-18T01:10:34ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Closed</i></li></ul>