Bug #8867

'arv-put' reports 100% upload but files only partially uploaded

Added by Peter Grandi over 4 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
SDKs
Target version:
-
Start date:
04/01/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Curious situation where someone uploads around 4TiB to a mostly-full Keep with arv-put, it is reported as 100% complete fine, and the file size looks right at around 16GB:

librarian@sole$ ls -ld DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam
-r-xr-xr-x 1 librarian librarian 16614852138 Mar 14 06:34 DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam
-rw-r--r-- 1 librarian      9993 16614852138 Jun  6  2015 /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam

But only the first 4GB of those 16GB are actually there:

librarian@sole$ cmp -l DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam /local/library2/ddd-data/EGAD00001001114/DDDP106312.20131218.bam/DDDP106312.20131218.bam | head -50
cmp: EOF on DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam
librarian@sole$ dd bs=1M if=DDD_WGS_EGAD00001001114_3/DDDP106312.20131218.bam of=/dev/null
3765+2 records in
3765+2 records out
3948298240 bytes (3.9 GB) copied, 82.117 s, 48.1 MB/s
librarian@sole$ expr 16614852138 - 3948298240
12666553898

Related issues

Related to Arvados - Story #8937: [SDKs] Write integration test for when arv-put resumes from a cache with expired access tokensResolved04/13/2016

History

#1 Updated by Peter Grandi over 4 years ago

On IRC I think PeterA suggested double checking with arv-get and I was perplexed because I get an error after some GB but it is an authorization error. This seems load related because if I redo it immediately afterwards the number of GB before the authorization error increases, presumably because the first lot of GB have been cached in the Keepstore server RAM.

manager@sole:~$ arv-get 'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd bs=1M of=/dev/null
3637 MiB / 15845 MiB 23.0%Traceback (most recent call last):
  File "/usr/local/bin/arv-get", line 207, in <module>
    for data in f.readall():
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 100, in readall
    data = self.read(size, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 51, in before_close_wrapper
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 209, in read
    num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom
    data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size])
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget
    return self._keep.get(locator, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get
    "failed to read {}".format(loc_s), service_errors, label="service")
arvados.errors.KeepReadError: failed to read 9535cb3d5ccfe7bc32ee45b2f0883add+67108864+A0dff717e2fad38e4d8603620b2aa0dcc4fdd2751@5703d80a: service http://keep2.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep9.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep8.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep0.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep6.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep7.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep3.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep4.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep5.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep1.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized

3+7268 records in
3+7268 records out
3813670912 bytes (3.8 GB) copied, 107.575 s, 35.5 MB/s

and then goes a bit further:

manager@sole:~$ arv-get 'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd bs=1M of=/dev/null
5173 MiB / 15845 MiB 32.6%Traceback (most recent call last):
  File "/usr/local/bin/arv-get", line 207, in <module>
    for data in f.readall():
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 100, in readall
    data = self.read(size, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 51, in before_close_wrapper
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/arvfile.py", line 209, in read
    num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 85, in readfrom
    data.append(self._keepget(lr.locator, num_retries=num_retries)[lr.segment_offset:lr.segment_offset+lr.segment_size])
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/stream.py", line 74, in _keepget
    return self._keep.get(locator, num_retries=num_retries)
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 980, in get
    "failed to read {}".format(loc_s), service_errors, label="service")
arvados.errors.KeepReadError: failed to read c2696e69993a8866df299178c61ba674+67108864+A21da046022e0a02cd3d418f3101fea89df255168@5703d8cc: service http://keep7.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep5.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep1.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep8.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep9.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep6.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep4.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep0.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep3.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized
; service http://keep2.gcam1.exampleplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized

3+10340 records in
3+10340 records out
5424283648 bytes (5.4 GB) copied, 105.053 s, 51.6 MB/s

What is worrying is not only that this happens, but that arv-mount seems to return an EOF at that point and not an IO error.

#2 Updated by Peter Grandi over 4 years ago

The reason seeems to be a typo:

https://github.com/curoverse/arvados/blob/master/services/keepstore/keepstore.go

    flag.IntVar(
        &permissionTTLSec,
        "blob-signature-ttl",
        int(time.Duration(2*7*24*time.Hour).Seconds()),
        "Lifetime of blob permission signatures. "+
            "See services/api/config/application.default.yml.")

blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second

The difficulty is that time.Second is defined as the second offset within the minute specified by t, in the range [0, 59]., so I suspect that should be changed to:

blobSignatureTTL = time.Duration(permissionTTLSec).Seconds()

BTW the immediate fix that I applied was to remove the "-blob-signature-ttl" option from the startup script for the keepstore daemon, as the default of 2 weeks is good if it is let to default.

#3 Updated by Peter Grandi over 4 years ago

BTW the code could use better names, because in the following:

// blobSignatureTTL is the time duration for which new permission
// signatures (returned by PUT requests) will be valid.
// Initialized by the -permission-ttl flag.
var blobSignatureTTL time.Duration

permissionTTLSec int

    flag.IntVar(
        &permissionTTLSec,
        "permission-ttl",
        0,
        "Synonym for -blob-signature-ttl.")
    flag.IntVar(
        &permissionTTLSec,
        "blob-signature-ttl",
        int(time.Duration(2*7*24*time.Hour).Seconds()),
        "Lifetime of blob permission signatures. "+
            "See services/api/config/application.default.yml.")

blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second

The variables permissionTTLSec is specified in seconds, while blobSignatureTTL is specified in Duration (I guess milliseconds), but both the options "permission-ttl" and "blob-signature-ttl" are specified in seconds, and indeed the variable for both is permissionTTLSec. Perhaps the variable blobSignatureTTL could be renamed as permissionTTLDur.

#4 Updated by Peter Grandi over 4 years ago

BTW only the likely solution to the subissue that the blob signing key lifetime gets set too small is described above.

The other issue is that under arv-mount the expiry of the blob signing key results in a mere EOF and not an error.

A colleague has written a Python SDK script and that seems to return a proper IO error and not a mere EOF.

#5 Updated by Tom Clegg over 4 years ago

I think the code in keepstore.go is correct, albeit less clear than it could be. We're using (time.Duration)Seconds() here, not (*time.Time)Second():

https://play.golang.org/p/4R3ChKj0qL

If keepstore's signature TTL is too low, arv-put and arv-mount will fail when trying to save a collection manifest (after writing all of the data).

If API server's signature TTL is too low, arv-get and arv-mount can fail (401 from keepstore) when trying to read data blocks. This looks like what's happening here, but is also bug #6833, which is supposed to be fixed by 1034616dcc47472072a5c6e2d7b92c9b95c544c9. If your arv-mount and python SDK are newer than 2016-01-21, perhaps that bug isn't truly fixed?

Also, please check blob_signature_ttl in your API server configuration (/etc/arvados/api/application.yml).

Returning EOF instead of an error could be a separate bug. We should make sure that's fixed. Can you post more info about your colleague's patch?

#6 Updated by Peter Amstutz over 4 years ago

Let's see if I'm following everything:

  • After increasing the token TTL, you can download the full 16 GiB? So the title of the bug report is inaccurate, because the 'arv-put' was successful, the problem is with accessing the data.
  • I think it's possible for 'arv-put' to fail for a similar reason; if the TTL is too short the signing tokens of the first blocks to be sent could expire before the collection record is created at the end. I filed #8910 which is related to this.
  • When the TTL is too low:
    • accessing the file on arv-mount results in a truncated read & EOF instead of IO error.
    • arv-get fails with an unhandled exception (which means it at least exits nonzero?)
  • arv-mount is supposed to refresh signatures, but that isn't working in this case. Glancing at the code, it looks like the problem is that the token refresh TTL is checked on directory operations but not file operations.

#7 Updated by Peter Grandi over 4 years ago

I think the code in keepstore.go is correct, albeit less clear than it could be. We're using (time.Duration)Seconds() here, not (*time.Time)Second():

blobSignatureTTL = time.Duration(permissionTTLSec) * time.Second

#8 Updated by Tom Clegg over 4 years ago

  • We use (time.Duration)Seconds() to come up with a default value for permissionTTLSec ...and this code looks correct to me
  • We use time.Second to convert the configured permissionTTLSec to a time.Duration ...and this code looks correct to me
  • We don't use (*time.Time)Second()

Anyway, this keepstore flag doesn't affect arv-get. I suspect the reason your arv-get doesn't finish is that you have a short blob_signature_ttl in your API server configuration (/etc/arvados/api/application.yml).

If you increase that to the default 1209600 and restart API server I expect arv-get will work fine.

My comment earlier, about #6833 fixing how we handle this case, wasn't right. That bugfix only affected arv-mount, not arv-get -- and even there it's still not smart enough to spend more than blob_signature_ttl seconds reading a single file: it only notices that it needs new signatures when doing directory accesses.

#9 Updated by Peter Grandi over 4 years ago

Ooops, I got confused between the two Second indeed.

The problem is that I think I had set first a TTL of 100 for the sake of garbage collection, then I set one of 144000 (on March 24th IIRC, and pretty surely on March 30th), and that seems to not have improved things, and then I removed the -blob-signature-ttl=144000 on April 7th, relying on the default. The default hardcoded in the source seems to work.

There is a chance that the keepstore daemons were running with a TTL of 100 up to April 7th, but I think they were running with 144000, and yet reading seemed to timeout after 80 seconds or more.

#10 Updated by Peter Grandi over 4 years ago

I have brought back the keepstore daemons to 100s signature TTL and some simple tests succeeded:

root@keepstore08:/var/log/upstart# date; ps aux | grep keep                
Tue Apr 12 08:50:07 UTC 2016                                                    
keepsto+  9856  1.0  2.2 268052 83776 ?        Ssl  08:35   0:09 keepstore      
-enforce-permissions -never-delete=false -max-buffers=35 -serialize=false       
-blob-signature-ttl=100                                                         
-blob-signing-key-file=/etc/arvados/keepstore/blob-signing-key                  
-data-manager-token-file=/etc/arvados/keepstore/data-manager-token              
root     10298  0.0  0.0  11744   928 pts/1    S+   08:50   0:00 grep           
--color=auto keep
librarian@sole$ date; time arv-put --replication 1 --resume                
--project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name                               
DDD_WGS_EGAD00001001114_test1 DDDP106320.20131218.bam                           
Tue Apr 12 08:50:14 UTC 2016                                                    
6974M / 6974M 100.0%                                                            
Collection saved as 'DDD_WGS_EGAD00001001114_test1 (5)'                         
gcam1-4zz18-trnb9ilfl7v58jv                                                     

real    3m8.645s                                                                
user    0m36.562s                                                               
sys     0m32.197s
That's about 37MB/s, back to normal
librarian@sole$ date; time env arv-get -n --md5sum                         
gcam1-4zz18-ctq6sk9diw2bs4m/                                                    
Tue Apr 12 09:01:32 UTC 2016                                                    
8ceac11844aeafabe49cf2088cf4e1be  ./DDDP106320.20131218.bam                     

real    1m29.747s                                                               
user    0m50.575s                                                               
sys     0m14.948s
That's about 77MB/s.
librarian@sole$ date; time arv-get                                         
'e75de44a842eabc482d37a17536fb03a+3024643/DDDP106312.20131218.bam' | dd         
iflag=fullblock bs=1M of=/dev/null                                              
Tue Apr 12 09:30:04 UTC 2016                                                    
15845 MiB / 15845 MiB 100.0%                                                    
15845+1 records in                                                              
15845+1 records out                                                             
16614852138 bytes (17 GB) copied, 316.513 s, 52.5 MB/s                          

real    5m16.517s                                                               
user    1m27.102s                                                               
sys     0m44.404s
root@keepstore08:/var/log/upstart# date; ps aux | grep keep                
Tue Apr 12 09:36:15 UTC 2016                                                    
keepsto+  9856  0.9  2.0 268052 76568 ?        Ssl  08:35   0:34 keepstore      
-enforce-permissions -never-delete=false -max-buffers=35 -serialize=false       
-blob-signature-ttl=100                                                         
-blob-signing-key-file=/etc/arvados/keepstore/blob-signing-key                  
-data-manager-token-file=/etc/arvados/keepstore/data-manager-token              
root     11609  0.0  0.0  11744   924 pts/1    S+   09:36   0:00 grep           
--color=auto keep

Then there is a perplexing thing overall about which I am quite uncertain: IIRC arv-put does not authenticate to the keepstore daemons, only at the end to the API daemon to register the manifest (unless -raw).

So it is not clear to me which "authorization" expired here:

arvados.errors.KeepReadError: failed to read                               
6d53405ab63f0b7c9ba87f751e39e4fa+35318337+Afcda546e5b4fd44a7d91385442044bebfc8d\
e1b6@57063c45: service http://keep5.gcam1.camdc.genomicsplc.com:25107/          
responded with 401 HTTP/1.1 401 Unauthorized

Also not clear to me what is the purpose of the blob signature TTL, as it does not seem to be about authorization either.

#11 Updated by Tom Clegg over 4 years ago

The "Permission" section of the Keep server wiki page has some info about how the signatures work.

#12 Updated by Tom Clegg over 4 years ago

arvados.errors.KeepReadError: failed to read 6d53405ab63f0b7c9ba87f751e39e4fa+35318337+Afcda546e5b4fd44a7d91385442044bebfc8de1b6@57063c45: service http://keep5.gcam1.camdc.genomicsplc.com:25107/ responded with 401 HTTP/1.1 401 Unauthorized

Did this error come from arv-get, or arv-put?

This is what I'd expect from arv-get if your API server's blob_signature_ttl is too low (i.e., less than the time arv-get takes to fetch all data). Normally blob_signature_ttl is 2 weeks, and arv-get doesn't take that long, so this doesn't come up. In principle arv-get should be smarter so it can re-fetch the collection and get new signatures even if the download takes longer than blob_signature_ttl, but it doesn't yet have that capability.

#13 Updated by Brett Smith over 4 years ago

We believe this problem was caused by arv-put resuming an upload after the original batch of blocks it uploaded had been deleted by Data Manager. The enhancement in #8937 should prevent that from recurring.

#15 Updated by Peter Amstutz 8 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF