Bug #8867
closed'arv-put' reports 100% upload but files only partially uploaded
Description
Curious situation where someone uploads around 4TiB to a mostly-full Keep with arv-put
, it is reported as 100% complete fine, and the file size looks right at around 16GB:
librarian@sole$ ls -ld DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam -r-xr-xr-x 1 librarian librarian 16614852138 Mar 14 06:34 DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam -rw-r--r-- 1 librarian 9993 16614852138 Jun 6 2015 /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam
But only the first 4GB of those 16GB are actually there:
librarian@sole$ cmp -l DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam | head -50 cmp: EOF on DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam
librarian@sole$ dd bs=1M if=DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam of=/dev/null 3765+2 records in 3765+2 records out 3948298240 bytes (3.9 GB) copied, 82.117 s, 48.1 MB/s librarian@sole$ expr 16614852138 - 3948298240 12666553898
Updated by Peter Grandi over 8 years ago
On IRC I think PeterA suggested double checking with arv-get
and I was perplexed because I get an error after some GB but it is an authorization error. This seems load related because if I redo it immediately afterwards the number of GB before the authorization error increases, presumably because the first lot of GB have been cached in the Keepstore server RAM.
manager@sole:~$ arv-get 'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd bs=1M of=/dev/null 3637 MiB / 15845 MiB 23.0%Traceback (most recent call last): File "/usr/local/bin/arv-get", line 207, in <module> for data in f.readall(): File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 100, in readall data = self.read(size, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 51, in before_close_wrapper return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 209, in read num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size]) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget return self._keep.get(locator, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get "failed to read {}".format(loc_s), service_errors, label="service") arvados.errors.KeepReadError: failed to read 9535cb3d5ccfe7bc32ee45b2f0883add+67108864+A0dff717e2fad38e4d8603620b2aa0dcc4fdd2751@5703d80a: service http://keep2.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep9.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep8.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep0.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep6.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep7.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep3.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep4.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep5.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep1.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized 3+7268 records in 3+7268 records out 3813670912 bytes (3.8 GB) copied, 107.575 s, 35.5 MB/s
and then goes a bit further:
manager@sole:~$ arv-get 'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd bs=1M of=/dev/null 5173 MiB / 15845 MiB 32.6%Traceback (most recent call last): File "/usr/local/bin/arv-get", line 207, in <module> for data in f.readall(): File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 100, in readall data = self.read(size, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 51, in before_close_wrapper return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 209, in read num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size]) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget return self._keep.get(locator, num_retries=num_retries) File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter return orig_func(self, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get "failed to read {}".format(loc_s), service_errors, label="service") arvados.errors.KeepReadError: failed to read c2696e69993a8866df299178c61ba674+67108864+A21da046022e0a02cd3d418f3101fea89df255168@5703d8cc: service http://keep7.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep5.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep1.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep8.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep9.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep6.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep4.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep0.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep3.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized ; service http://keep2.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized 3+10340 records in 3+10340 records out 5424283648 bytes (5.4 GB) copied, 105.053 s, 51.6 MB/s
What is worrying is not only that this happens, but that arv-mount
seems to return an EOF at that point and not an IO error.
Updated by Peter Grandi over 8 years ago
The reason seeems to be a typo:
https://github.com/curoverse/arvados/blob/master/services/keepstore/keepstore.go
flag.IntVar( &permissionTTLSec, "blob-signature-ttl", int(time.Duration(2*7*24*time.Hour).Seconds()), "Lifetime of blob permission signatures. "+ "See services/api/config/application.default.yml.")
blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second
The difficulty is that time.Second
is defined as the second offset within the minute specified by t, in the range [0, 59]., so I suspect that should be changed to:
blobSignatureTTL = time.Duration(permissionTTLSec).Seconds()
BTW the immediate fix that I applied was to remove the "-blob-signature-ttl" option from the startup script for the keepstore
daemon, as the default of 2 weeks is good if it is let to default.
Updated by Peter Grandi over 8 years ago
BTW the code could use better names, because in the following:
// blobSignatureTTL is the time duration for which new permission // signatures (returned by PUT requests) will be valid. // Initialized by the -permission-ttl flag. var blobSignatureTTL time.Duration
permissionTTLSec int
flag.IntVar( &permissionTTLSec, "permission-ttl", 0, "Synonym for -blob-signature-ttl.") flag.IntVar( &permissionTTLSec, "blob-signature-ttl", int(time.Duration(2*7*24*time.Hour).Seconds()), "Lifetime of blob permission signatures. "+ "See services/api/config/application.default.yml.")
blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second
The variables permissionTTLSec
is specified in seconds, while blobSignatureTTL
is specified in Duration (I guess milliseconds), but both the options "permission-ttl"
and "blob-signature-ttl"
are specified in seconds, and indeed the variable for both is permissionTTLSec
. Perhaps the variable blobSignatureTTL
could be renamed as permissionTTLDur
.
Updated by Peter Grandi over 8 years ago
BTW only the likely solution to the subissue that the blob signing key lifetime gets set too small is described above.
The other issue is that under arv-mount
the expiry of the blob signing key results in a mere EOF and not an error.
A colleague has written a Python SDK script and that seems to return a proper IO error and not a mere EOF.
Updated by Tom Clegg over 8 years ago
I think the code in keepstore.go is correct, albeit less clear than it could be. We're using (time.Duration)Seconds()
here, not (*time.Time)Second()
:
https://play.golang.org/p/4R3ChKj0qL
If keepstore's signature TTL is too low, arv-put and arv-mount will fail when trying to save a collection manifest (after writing all of the data).
If API server's signature TTL is too low, arv-get and arv-mount can fail (401 from keepstore) when trying to read data blocks. This looks like what's happening here, but is also bug #6833, which is supposed to be fixed by 1034616dcc47472072a5c6e2d7b92c9b95c544c9. If your arv-mount and python SDK are newer than 2016-01-21, perhaps that bug isn't truly fixed?
Also, please check blob_signature_ttl
in your API server configuration (/etc/arvados/api/application.yml
).
Returning EOF instead of an error could be a separate bug. We should make sure that's fixed. Can you post more info about your colleague's patch?
Updated by Peter Amstutz over 8 years ago
Let's see if I'm following everything:
- After increasing the token TTL, you can download the full 16 GiB? So the title of the bug report is inaccurate, because the 'arv-put' was successful, the problem is with accessing the data.
- I think it's possible for 'arv-put' to fail for a similar reason; if the TTL is too short the signing tokens of the first blocks to be sent could expire before the collection record is created at the end. I filed #8910 which is related to this.
- When the TTL is too low:
- accessing the file on
arv-mount
results in a truncated read & EOF instead of IO error. - arv-get fails with an unhandled exception (which means it at least exits nonzero?)
- accessing the file on
- arv-mount is supposed to refresh signatures, but that isn't working in this case. Glancing at the code, it looks like the problem is that the token refresh TTL is checked on directory operations but not file operations.
Updated by Peter Grandi over 8 years ago
I think the code in keepstore.go is correct, albeit less clear than it could be. We're using
(time.Duration)Seconds()
here, not(*time.Time)Second()
:
blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second
Updated by Tom Clegg over 8 years ago
- We use
(time.Duration)Seconds()
to come up with a default value forpermissionTTLSec
...and this code looks correct to me - We use
time.Second
to convert the configuredpermissionTTLSec
to atime.Duration
...and this code looks correct to me - We don't use
(*time.Time)Second()
Anyway, this keepstore flag doesn't affect arv-get
. I suspect the reason your arv-get
doesn't finish is that you have a short blob_signature_ttl
in your API server configuration (/etc/arvados/api/application.yml
).
If you increase that to the default 1209600 and restart API server I expect arv-get
will work fine.
My comment earlier, about #6833 fixing how we handle this case, wasn't right. That bugfix only affected arv-mount
, not arv-get
-- and even there it's still not smart enough to spend more than blob_signature_ttl
seconds reading a single file: it only notices that it needs new signatures when doing directory accesses.
Updated by Peter Grandi over 8 years ago
Ooops, I got confused between the two Second
indeed.
The problem is that I think I had set first a TTL of 100 for the sake of garbage collection, then I set one of 144000 (on March 24th IIRC, and pretty surely on March 30th), and that seems to not have improved things, and then I removed the -blob-signature-ttl=144000
on April 7th, relying on the default. The default hardcoded in the source seems to work.
There is a chance that the keepstore
daemons were running with a TTL of 100 up to April 7th, but I think they were running with 144000, and yet reading seemed to timeout after 80 seconds or more.
Updated by Peter Grandi over 8 years ago
I have brought back the keepstore
daemons to 100s signature TTL and some simple tests succeeded:
root@keepstore08:/var/log/upstart# date; ps aux | grep keep Tue Apr 12 08:50:07 UTC 2016 keepsto+ 9856 1.0 2.2 268052 83776 ? Ssl 08:35 0:09 keepstore -enforce-permissions -never-delete=false -max-buffers=35 -serialize=false -blob-signature-ttl=100 -blob-signing-key-file=/etc/arvados/keepstore/blob-signing-key -data-manager-token-file=/etc/arvados/keepstore/data-manager-token root 10298 0.0 0.0 11744 928 pts/1 S+ 08:50 0:00 grep --color=auto keep
librarian@sole$ date; time arv-put --replication 1 --resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114_test1 DDDP106320.20131218.bam Tue Apr 12 08:50:14 UTC 2016 6974M / 6974M 100.0% Collection saved as 'DDD_WGS_EGAD00001001114_test1 (5)' gcam1-4zz18-trnb9ilfl7v58jv real 3m8.645s user 0m36.562s sys 0m32.197sThat's about 37MB/s, back to normal
librarian@sole$ date; time env arv-get -n --md5sum gcam1-4zz18-ctq6sk9diw2bs4m/ Tue Apr 12 09:01:32 UTC 2016 8ceac11844aeafabe49cf2088cf4e1be ./DDDP106320.20131218.bam real 1m29.747s user 0m50.575s sys 0m14.948sThat's about 77MB/s.
librarian@sole$ date; time arv-get 'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd iflag=fullblock bs=1M of=/dev/null Tue Apr 12 09:30:04 UTC 2016 15845 MiB / 15845 MiB 100.0% 15845+1 records in 15845+1 records out 16614852138 bytes (17 GB) copied, 316.513 s, 52.5 MB/s real 5m16.517s user 1m27.102s sys 0m44.404s
root@keepstore08:/var/log/upstart# date; ps aux | grep keep Tue Apr 12 09:36:15 UTC 2016 keepsto+ 9856 0.9 2.0 268052 76568 ? Ssl 08:35 0:34 keepstore -enforce-permissions -never-delete=false -max-buffers=35 -serialize=false -blob-signature-ttl=100 -blob-signing-key-file=/etc/arvados/keepstore/blob-signing-key -data-manager-token-file=/etc/arvados/keepstore/data-manager-token root 11609 0.0 0.0 11744 924 pts/1 S+ 09:36 0:00 grep --color=auto keep
Then there is a perplexing thing overall about which I am quite uncertain: IIRC arv-put
does not authenticate to the keepstore
daemons, only at the end to the API daemon to register the manifest (unless -raw
).
So it is not clear to me which "authorization" expired here:
arvados.errors.KeepReadError: failed to read 6d53405ab63f0b7c9ba87f751e39e4fa+35318337+Afcda546e5b4fd44a7d91385442044bebfc8d\ e1b6@57063c45: service http://keep5.gcam1.camdc.genomicsplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
Also not clear to me what is the purpose of the blob signature TTL, as it does not seem to be about authorization either.
Updated by Tom Clegg over 8 years ago
The "Permission" section of the Keep server wiki page has some info about how the signatures work.
Updated by Tom Clegg over 8 years ago
arvados.errors.KeepReadError: failed to read 6d53405ab63f0b7c9ba87f751e39e4fa+35318337+Afcda546e5b4fd44a7d91385442044bebfc8de1b6@57063c45: service http://keep5.gcam1.camdc.genomicsplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
Did this error come from arv-get, or arv-put?
This is what I'd expect from arv-get
if your API server's blob_signature_ttl
is too low (i.e., less than the time arv-get takes to fetch all data). Normally blob_signature_ttl
is 2 weeks, and arv-get doesn't take that long, so this doesn't come up. In principle arv-get should be smarter so it can re-fetch the collection and get new signatures even if the download takes longer than blob_signature_ttl
, but it doesn't yet have that capability.
Updated by Brett Smith over 8 years ago
We believe this problem was caused by arv-put resuming an upload after the original batch of blocks it uploaded had been deleted by Data Manager. The enhancement in #8937 should prevent that from recurring.
Updated by Peter Amstutz almost 5 years ago
- Status changed from New to Resolved