Arvados: Issueshttps://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422016-05-26T08:02:52ZArvados
Redmine Arvados - Bug #9304 (Closed): WB: slight improvement to copy-and-paste env token setuphttps://dev.arvados.org/issues/93042016-05-26T08:02:52ZPeter Grandipg@arvados.for.sabi.co.uk
<p>When <code>HISTIGNORE</code> is not set this can be slightly inappropriate (e.g. in scripts run with option '-u' to BASH):</p>
<pre>
HISTIGNORE=$HISTIGNORE:'export ARVADOS_API_TOKEN=*'
</pre>
<p>This could be slightly more appropriate:</p>
<pre>
HISTIGNORE="${HISTIGNORE+$HISTIGNORE:}"'export ARVADOS_API_TOKEN=*'
</pre> Arvados - Bug #8998 (Resolved): [API] Memory overflow when dumping a 25TB collection as JSONhttps://dev.arvados.org/issues/89982016-04-15T09:10:48ZPeter Grandipg@arvados.for.sabi.co.uk
<p>When uploading a collection of nearly 25TiB in a bit over 3,000 files with <code>arv-put</code> the outcome was:</p>
<pre>
librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-k25rlhe6ig8p9na">gcam1-j7d0g-k25rlhe6ig8p9na</a> --name DDD_WGS_EGAD00001001114 DDDP*
25956153M / 25956153M 100.0%
arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">.
Traceback (most recent call last):
File "/usr/local/bin/arv-put", line 4, in <module>
main()
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main
stdout.write(output)
UnboundLocalError: local variable 'output' referenced before assignment
</pre>
<p>From logs:</p>
<pre>
Oh... fiddlesticks.
An error occurred when Workbench sent a request to the Arvados API server. Try reloading this page. If the problem is temporary, your request might go through next time. If that doesn't work, the information below can help system administrators track down the problem.
API request URL
https://gcam1.camdc.genomicsplc.com/arvados/v1/collections/gcam1-4zz18-i4nlpovriwdxu6j
API response
{
":errors":[
"#<NoMemoryError: failed to allocate memory>"
],
":error_token":"1457687649+b12feaf3"
}
</pre>
<p>and I have attached the longer backtrace from a previous log.</p>
<p>A 25TB upload should result in a 15MB manifest, large but it should not overflow the API server that has 4GiB of memory.</p>
<p>Anyhow we can allocate more GiB of memory, but it would be nice to have a guideline as to how many are needed in relationship to largest collection size.</p>
<p>Perhaps 25TB collections are too large, especially considering the resulting manifest size, and my understanding that any access to a file in a collection results in the latency of a download of the full manifest.</p>
<p>But I have been told that we have a requirement for arbitrary naming conventions, where it is not acceptable to split large sets of data (many small files or fewer large files) into separate collections, like "data-subset-1", "data-subset-2", "data-subset-3", ... solely because of storage system limitations.</p> Arvados - Idea #8997 (Closed): Keep: rethink role of "signature tokens"https://dev.arvados.org/issues/89972016-04-15T08:27:08ZPeter Grandipg@arvados.for.sabi.co.uk
<p>During last night around 4AM I woke up and suddenly I understood (or I think I did) the role of "signature tokens" described in:</p>
<p><a class="external" href="https://dev.arvados.org/projects/arvados/wiki/Keep_server#Permission">https://dev.arvados.org/projects/arvados/wiki/Keep_server#Permission</a></p>
<p>especially in relationship to block lifetimes as per issues:</p>
<p><a class="external" href="https://dev.arvados.org/issues/8993">https://dev.arvados.org/issues/8993</a><br /><a class="external" href="https://dev.arvados.org/issues/8878">https://dev.arvados.org/issues/8878</a><br /><a class="external" href="https://dev.arvados.org/issues/8867">https://dev.arvados.org/issues/8867</a></p>
<p>So my current understanding is...</p>
<p>In Keep block liveness is reachability from a server-side manifest or a client-side signature token. Similar to UNIX directory entries+inodes for manifests or file descriptors for files, where "authorization" in the form of having a file descriptor implies liveness.</p>
<p>But permissions tokens are client-side and persistent capabilities, even if time-limited, unlike file descriptors that are server-side and disappear on reboot (which is a raw form of garbage collection) capabilities.</p>
<p>Since the Data Manager cannot trace signature tokens, which may be anywhere, only manifests, it must make worst-case assumptions on them, both as to their existence and expiry times, which implies that the block signature TTL must be monotonically increasing, which is hard to ensure.</p>
<p>One could have Keep record server-side which signature tokens have been issued (block and lifetime), and have the same signature token cover multiple blocks too, but then they become essentially temporary collections ("partial manifests" IIRC).</p>
<p>Also it is pointless for Keep to issue read permissions tokens to 'arv-put' when it uploads a block, as it does not need to read them.</p>
<p>All that 'arv-put' needs to know is that when it registers a manifest all block hashes mentioned in it are live if the registration succeeded.</p>
<p>So what should actually happen is that the API server on registering a manifest verifies it has all the blocks for the hashes in the manifest, and otherwise returns a list of the blocks it does not have (and signature tokens should just be about permissions, not necessarily imply liveness, because they should not have the same dual role file descriptors have).</p>
<p>That's because it then becomes a compare-and-swap server-side sequence of atomic transactions. In a distributed setup the best that can be hoped for is eventual or even potential convergence.</p>
<p>The verification can be based on first checking whether the hashes in the new manifest are already present in other manifests (and "locking" them for the duration), and then asking all the keep servers to check, and a non-persistent TTL guarantee for the result of that check may<br />happen at that point.</p>
<p>Everything else is an optimization, for example:</p>
<ul>
<li>Having 'arv-put' effectively do that check during the upload, e.g. by registering temporary partial collection manifests, issuing at the end the final full manifest and after that deleting the temporary partial ones.</li>
<li>Maybe every Keepstore server keeping a persistent hint-list of blocks it has, and perhaps the API server keeping a persistent hint-list of recently known to be live blocks and on which servers.</li>
</ul>
<p>PS Computery-sciency stuff that may be related: Dijkstra parallel garbage collector with "white", "black", or "grey" (being uploaded) states. Also P Bishop's distributed parallel garbage collector MIT TR-178 (and successors).</p> Arvados - Feature #8993 (Closed): arv-put: options for 3 modes of "resumption"https://dev.arvados.org/issues/89932016-04-14T13:40:39ZPeter Grandipg@arvados.for.sabi.co.uk
<p>Because of <a class="external" href="https://dev.arvados.org/issues/8878">https://dev.arvados.org/issues/8878</a> it seems to be that <code>arv-put</code> has 2 modes of operation when uploading hash+block:</p>
<ul>
<li>Without <code>--no-resume</code>, and if the hash+block is listed in the resume list, and the permissions token for it is not expired, neither hash nor block are uploaded, they are presume to be present in Keep, and simply added to the upload manifest.</li>
<li>Otherwise, both hash and block are uploaded and it is Keep's job to avoid unnecessarily duplicating them.</li>
</ul>
<p>The problem with with the first case is that is the resume list is assumed to be a valid cache (until permission token expiry), while instead <strong>arguably</strong> it should be treated as a hint and verified before use.</p>
<p>The problem with the second case is that the whole block is uploaded, consuming resources, even if Keep then determines it is already present.</p>
<p>This request is for a 3rd case and a different default, for example in the following form, with 3 values for a new option <code>--upload-again</code>:</p>
<ul>
<li><code>yes</code> with the same meaning as current <code>--no-resume</code>, that is unconditionally upload all hashes and their blocks.</li>
<li><code>no</code> with a similar (or even identical) meaning as current <code>--resume</code>, that is upload hashes and blocks only if they are not mentioned in whichever already-uploaded hash list is available.</li>
<li><code>check</code> with a new meaning, to send to all Keepstore daemons the list of hashes to be uploaded (possibly in subsets), which then return a list of those that are found present, with an <em>absolute</em> expiry time, and then to upload all other hashes and blocks, and at the end upload the hashes and blocks in the returned list only if their lifetime has expired, and then write the manifest. Or some obvious variant.</li>
</ul>
<p>The default would be <code>--upload-again=yes</code> for safety, with <code>check</code> recommended and <code>no</code> suggested only for "optimistic" cases.</p>
<p>The main flaw of the current option is that it hold outside Keep block state that is persistent and does not get verified, even if the permission expiry time is a block lifetime time only advisorily.</p> Arvados - Feature #8992 (Closed): arv-put: option to create a "stream" per-file (default) or per-...https://dev.arvados.org/issues/89922016-04-14T12:38:10ZPeter Grandipg@arvados.for.sabi.co.uk
<p>As in <a class="external" href="https://dev.arvados.org/issues/8769">https://dev.arvados.org/issues/8769</a> I was dismayed to learn that <code>arv-put</code> by default computes block hashes on a per-stream (usually per-collection) basis, which means that in the general cases the same file will have a different hash list depending on how it has been uploaded, that is its position in a "stream" containing the contents of all files in a collection or it being the only member of a "stream".</p>
<p>This has the surprising effect that the same file uploaded twice to the same or different Keep instances by <code>arv-put</code> may have and usually will have different hash-lists associated with it. Apparently <code>arv-mount</code> does not do that by default.</p>
<p>Since concatenating the content of several files into one "stream" is an optimization for small files, similar to OS/MVS "partitioned datasets", or UNIX "ar" archives, it would be better if it were optional, and not the default either.</p>
<p>In part because given the huge latencies (which include manifest download time) Keep is not that suited to storing small files, in part because most keep storage backends have granularities well below the 64MiB size of a keep block, in part because it is not fully documented, in part because reproducibility of hashes across time and Keep instances would be nice.</p>
<p>It would also be nice to have a direct option similar to <code>--md5sum</code> in <code>arv-get</code> that computes the list of hashes for a local file without uploading it at all, and to add a corresponding option to <code>arv-get</code> that prints that list in exactly the same format.</p> Arvados - Bug #8878 (Closed): Keep: sudden appearance of "missing" blockshttps://dev.arvados.org/issues/88782016-04-04T16:02:00ZPeter Grandipg@arvados.for.sabi.co.uk
<p>I had done a "garbage collection" before Easter as follows:</p>
<pre>
2016/03/24 17:06:10 Read and processed 417 collections
2016/03/24 17:06:13 Blocks In Collections: 514668,
Blocks In Keep: 961866.
2016/03/24 17:06:13 Replication Block Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 1650,
Replicated Just Right: 513018,
Not In Any Collection: 447198.
Replication Collection Counts:
Missing From Keep: 0,
Under Replicated: 0,
Over Replicated: 11,
Replicated Just Right: 406.
2016/03/24 17:06:13 Blocks Histogram:
2016/03/24 17:06:13 {Requested:0 Actual:1}: 444455
2016/03/24 17:06:13 {Requested:0 Actual:2}: 2743
2016/03/24 17:06:13 {Requested:1 Actual:1}: 513018
2016/03/24 17:06:13 {Requested:1 Actual:2}: 1647
2016/03/24 17:06:13 {Requested:1 Actual:3}: 3
2016/03/24 17:06:15 Sending trash list to http://keep9.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep3.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep6.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep5.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep0.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep4.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep7.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep8.gcam1.example.com:25107
2016/03/24 17:06:15 Sending trash list to http://keep1.gcam1.example.com:25107
2016/03/24 17:06:15 Sent trash list to http://keep1.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:15 Sent trash list to http://keep0.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep4.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep9.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep3.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep5.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep8.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep7.gcam1.example.com:25107: response was HTTP 200 OK
2016/03/24 17:06:16 Sent trash list to http://keep6.gcam1.example.com:25107: response was HTTP 200 OK
</pre>
<p>Then after uploading two 4GB collections over the past week, we have deleted the 2 4GB collections that they were meant to replace, and then I run the Data Manager again in dry-run mode and the outcome is:</p>
<pre>
2016/04/04 12:51:17 Read and processed 421 collections
2016/04/04 12:51:19 Blocks In Collections: 782548,
Blocks In Keep: 716788.
2016/04/04 12:51:19 Replication Block Counts:
Missing From Keep: 65760,
Under Replicated: 0,
Over Replicated: 41180,
Replicated Just Right: 675608,
Not In Any Collection: 0.
Replication Collection Counts:
Missing From Keep: 3,
Under Replicated: 0,
Over Replicated: 13,
Replicated Just Right: 405.
2016/04/04 12:51:19 Blocks Histogram:
2016/04/04 12:51:19 {Requested:1 Actual:0}: 65760
2016/04/04 12:51:19 {Requested:1 Actual:1}: 675608
2016/04/04 12:51:19 {Requested:1 Actual:2}: 41177
2016/04/04 12:51:19 {Requested:1 Actual:3}: 3
</pre>
<p>It is disconcerting to see <code>{Requested:1 Actual:0}: 65760</code> (around 4GiB) but also <code>{Requested:1 Actual:2}: 41177</code> (around 2.5GiB).</p>
<p>The two collections that were uploaded to replace the two that were deleted should have been exactly identical byte for byte, as the re-uploads were from the same files using identically the same file list.</p>
<p>A question I have is whether there is a tool that can tell me which collections and files within them have missing hashes. I think that I can easily modify some of my scripts to that purpose, so I would like to know if there is a tool that I can use as a double check.</p>
<p>The other question is whether I can run with Data Manager further consistency checks, for example as to verifying the hashes of the data blocks.</p> Arvados - Bug #8867 (Resolved): 'arv-put' reports 100% upload but files only partially uploadedhttps://dev.arvados.org/issues/88672016-04-01T14:15:19ZPeter Grandipg@arvados.for.sabi.co.uk
<p>Curious situation where someone uploads around 4TiB to a mostly-full Keep with <code>arv-put</code>, it is reported as 100% complete fine, and the file size looks right at around 16GB:</p>
<pre>librarian@sole$ ls -ld DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam
-r-xr-xr-x 1 librarian librarian 16614852138 Mar 14 06:34 DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam
-rw-r--r-- 1 librarian 9993 16614852138 Jun 6 2015 /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam
</pre>
<p>But only the first 4GB of those 16GB are actually there:</p>
<pre>librarian@sole$ cmp -l DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam | head -50
cmp: EOF on DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam
</pre>
<pre>librarian@sole$ dd bs=1M if=DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam of=/dev/null
3765+2 records in
3765+2 records out
3948298240 bytes (3.9 GB) copied, 82.117 s, 48.1 MB/s
librarian@sole$ expr 16614852138 - 3948298240
12666553898
</pre> Arvados - Bug #8769 (Resolved): re-upload seems to consume a lot of spacehttps://dev.arvados.org/issues/87692016-03-22T15:57:11ZPeter Grandipg@arvados.for.sabi.co.uk
<p>We had a 30TiB Keep setup (5x Keepstore nodes each with 6x 1TiB Keepstore volumes) and added another 30TiB (same setup).</p>
<p>Then we uploaded a 25TiB collection. This failed with:</p>
<pre>
librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-k25rlhe6ig8p9na">gcam1-j7d0g-k25rlhe6ig8p9na</a> --name DDD_WGS_EGAD00001001114 DDDP*
25956153M / 25956153M 100.0%
arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">.
Traceback (most recent call last):
File "/usr/local/bin/arv-put", line 4, in <module>
main()
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main
stdout.write(output)
UnboundLocalError: local variable 'output' referenced before assignment
Attachments
</pre>
<p>We have then started to re-uploaded the 25TiB collection as 6x subsets, 3x at a time, and all 3 first re-uploads failed because of running out space as in:</p>
<pre>
librarian@sole$ time arv-put --replication 1 --no-resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-k25rlhe6ig8p9na">gcam1-j7d0g-k25rlhe6ig8p9na</a> --name DDD_WGS_EGAD00001001114_4 $(< ~/l4)
1241152M / 4228192M 29.4% Traceback (most recent call last):
File "/usr/local/bin/arv-put", line 4, in <module>
main()
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 484, in main
path, max_manifest_depth=args.max_manifest_depth)
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 334, in write_directory_tree
path, stream_name, max_manifest_depth)
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 216, in write_directory_tree
self.do_queued_work()
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 144, in do_queued_work
self._work_file()
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 157, in _work_file
self.write(buf)
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 471, in write
return super(ResumableCollectionWriter, self).write(data)
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 227, in write
self.flush_data()
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data
super(ArvPutCollectionWriter, self).flush_data()
File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data
copies=self.replication))
File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
return orig_func(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1065, in put
data_hash, copies, thread_limiter.done()), service_errors, label="service")
arvados.errors.KeepWriteError: failed to write 041e9f3b83a075608ee1227acc757b0c (wanted 1 copies but wrote 0): service http://keep9.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep0.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep2.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep4.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep5.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep7.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep8.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep1.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep6.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable; service http://keep3.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable
real 2226m47.733s
user 135m7.266s
sys 116m52.827s
</pre>
<p>The 'arv-put' command is from the Debian package dated 160311.</p>
<p>What perplexed me in the above is that there was still quite a bit of free space. In the attached free space report the inflection point around "Friday" is when the re-upload was started. I was surprised to see fast decreasing space for uploads of content that had already been allegedly 100% uploaded.</p>
<p>I have enumerated all the blocks on all 10 Keepstore servers and there around 950k, with around 24k duplicates (and 6 triplicates), that is there are only 1.5TB of duplicates. Also those duplicates are entirely on two Keepstores that were part of the first set of 5, which had filled up before the re-upload (bottom yellow and orange in the graph). There is a perhaps a chance that on the original upload the "25956153M / 25956153M 100.0%" report might have been optimistic.</p>
<p>What worries me is the possibility that different hashes may be assigned to the same content. Suggestions and comments would be interesting.</p> Arvados - Bug #7573 (Resolved): Keepstore: very uneven distribution of blob between 2 Keepstore s...https://dev.arvados.org/issues/75732015-10-15T16:24:17ZPeter Grandipg@arvados.for.sabi.co.uk
<p>Testing Keepstore. Had <code>keep0</code> with a 100GB volume. Created 2 new filesystem, 03 and 04, and copied the contents of the existing one into 03 and delete the existing one, and restarted the daemon. Created <code>keep1</code> with filesystems 01 and 02. All 4 filesystems are 1TiB. Registered <code>keep1</code> with the API server.</p>
<p>When uploading with <code>arv-put</code> a small number of files each of a few GB plus 1 file of 60GB the 64MiB blobs get distributed as follows:</p>
<p><code>keep0</code>:<br /><pre>$ find /var/lib/keepstore/gcam1-keep-04 -type f | wc -l
2264
$ find /var/lib/keepstore/gcam1-keep-03 -type f | wc -l
3268</pre></p>
<p><code>keep1</code>:<br /><pre>$ find /var/lib/keepstore/gcam1-keep-02 -type f | wc -l
3
$ find /var/lib/keepstore/gcam1-keep-01 -type f | wc -l
2</pre></p>
<p>That seems very strange to me. I can understand the filesystem 03 has more blobs then 04 because it has the "old" blobs.</p>
<p>Looking at a couple of the 5 blobs on <code>keep1</code> in 01 and 02 they seem to belong to files stored almost entirely on <code>keep0</code>. What seems strange to me is both that:</p>
<ul>
<li>Files are not evenly distributed between <code>keep0</code> and <code>keep1</code>.</li>
<li>They are evenly distributed between 03 and 04 on <code>keep0</code> but some stray blobs end up (apparently evenly distributed) on <code>keep1</code>.</li>
</ul> Arvados - Bug #7572 (Closed): [SDKs] arv-put crashes with Broken Pipe socket.error after uploadin...https://dev.arvados.org/issues/75722015-10-15T16:00:31ZPeter Grandipg@arvados.for.sabi.co.uk
<p>Testing Keepstore. Uploading with 'arv-put' midsize files like up to a few GB works.</p>
<p>First attempt to upload 60GB fails with 'arv-put' crashing at the 100% mark. Attempts to upload the same 60GB file also crash at the same point. Second and further attempts as expected show only read activity on the Keepstores: all blobs have been uploaded.</p>
<p>Context: recently installed, freshly updated setup. Not using SSO but direct token.</p>
<p>From API server log:</p>
<pre>Started PUT "/arvados/v1/users/gcam1-tpzed-42l58gq9xqdzxkb" for 127.0.0.1 at 2015-10-13 14:52:32 +0000
Processing by Arvados::V1::UsersController#update as */*
Parameters: {"api_token"=>"60l1om9jukg1y7qpu1a6uqeevd29zc9rruqe0yc3anbg3k6b7f", "reader_tokens"=>"[false]", "user"=>"{\"first_name\":\"Librarian\",\"prefs\":{\"getting_started_shown\":\"2015-09-09T12:05:04.170+00:00\"}}", "id"=>"<a href="https://arvadosapi.com/gcam1-tpzed-42l58gq9xqdzxkb">gcam1-tpzed-42l58gq9xqdzxkb</a>"}
WARNING: Can't verify CSRF token authenticity
Rendered text template (0.0ms)
Completed 200 OK in 65.3ms (Views: 0.6ms | ActiveRecord: 9.3ms)</pre>
<p>The crash report itself:</p>
<pre>$ arv-put --replication 1 --no-resume --project-uuid <a href="https://arvadosapi.com/gcam1-j7d0g-soxcyrmt3m87u2z">gcam1-j7d0g-soxcyrmt3m87u2z</a> --name dbNSFP2.5 dbNSFP2.5/
60492M / 60492M 100.0%
Traceback (most recent call last):
File "/usr/local/bin/arv-put", line 4, in <module> main()
File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 517, in main
).execute(num_retries=args.retries)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 142, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/http.py", line 722, in execute
body=self.body, headers=self.headers)
File "/usr/local/lib/python2.7/dist-packages/arvados/api.py", line 54, in _intercept_http_request
return self.orig_http_request(uri, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1609, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1351, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1273, in _conn_request
conn.request(method, request_uri, body, headers)
File "/usr/lib/python2.7/httplib.py", line 979, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 811, in send
self.sock.sendall(data)
File "/usr/lib/python2.7/ssl.py", line 329, in sendall
v = self.send(data[count:])
File "/usr/lib/python2.7/ssl.py", line 298, in send
v = self._sslobj.write(data)
socket.error: [Errno 32] Broken pipe</pre>
<p>My current wild guess is some kind of timeout.</p>