Bug #8769

re-upload seems to consume a lot of space

Added by Peter Grandi about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
-
Start date:
03/22/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

We had a 30TiB Keep setup (5x Keepstore nodes each with 6x 1TiB Keepstore volumes) and added another 30TiB (same setup).

Then we uploaded a 25TiB collection. This failed with:


librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114 DDDP*
25956153M / 25956153M 100.0%
arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">.
Traceback (most recent call last):
 File "/usr/local/bin/arv-put", line 4, in <module>
   main()
 File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main
   stdout.write(output)
UnboundLocalError: local variable 'output' referenced before assignment

Attachments

We have then started to re-uploaded the 25TiB collection as 6x subsets, 3x at a time, and all 3 first re-uploads failed because of running out space as in:

librarian@sole$ time arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114_4 $(< ~/l4)
1241152M / 4228192M 29.4% Traceback (most recent call last):
  File "/usr/local/bin/arv-put", line 4, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 484, in main
    path, max_manifest_depth=args.max_manifest_depth)
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 334, in write_directory_tree
    path, stream_name, max_manifest_depth)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 216, in write_directory_tree
    self.do_queued_work()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 144, in do_queued_work
    self._work_file()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 157, in _work_file
    self.write(buf)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 471, in write
    return super(ResumableCollectionWriter, self).write(data)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 227, in write
    self.flush_data()
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data
    super(ArvPutCollectionWriter, self).flush_data()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data
    copies=self.replication))
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1065, in put
    data_hash, copies, thread_limiter.done()), service_errors, label="service")
arvados.errors.KeepWriteError: failed to write 041e9f3b83a075608ee1227acc757b0c (wanted 1 copies but wrote 0): service http://keep9.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep0.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep2.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep4.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep5.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep7.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep8.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep1.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep6.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep3.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable

real    2226m47.733s
user    135m7.266s
sys     116m52.827s

The 'arv-put' command is from the Debian package dated 160311.

What perplexed me in the above is that there was still quite a bit of free space. In the attached free space report the inflection point around "Friday" is when the re-upload was started. I was surprised to see fast decreasing space for uploads of content that had already been allegedly 100% uploaded.

I have enumerated all the blocks on all 10 Keepstore servers and there around 950k, with around 24k duplicates (and 6 triplicates), that is there are only 1.5TB of duplicates. Also those duplicates are entirely on two Keepstores that were part of the first set of 5, which had filled up before the re-upload (bottom yellow and orange in the graph). There is a perhaps a chance that on the original upload the "25956153M / 25956153M 100.0%" report might have been optimistic.

What worries me is the possibility that different hashes may be assigned to the same content. Suggestions and comments would be interesting.

keepstoresGbFree.png (29.6 KB) keepstoresGbFree.png Peter Grandi, 03/22/2016 03:49 PM

Related issues

Related to Arvados - Story #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationDuplicate

History

#1 Updated by Peter Grandi about 5 years ago

From #Arvados I got a pointed to a potential duplication issue in older arv-put as: https://dev.arvados.org/issues/6358

#2 Updated by Peter Grandi about 5 years ago

Report from a typical Keepstore server for 3 typical Keepstore filetrees:

manager@keepstore07:~$ sudo du -sm /var/lib/keepstore/gcam1-keep-4[345]
1047961 /var/lib/keepstore/gcam1-keep-43
1047976 /var/lib/keepstore/gcam1-keep-44
1047960 /var/lib/keepstore/gcam1-keep-45

manager@keepstore07:~$ df -T -BG /var/lib/keepstore/gcam1-keep-43 /var/lib/keepstore/gcam1-keep-44 /var/lib/keepstore/gcam1-keep-45
Filesystem     Type 1G-blocks  Used Available Use% Mounted on
/dev/vdc1      xfs      1024G 1024G        1G 100% /var/lib/keepstore/gcam1-keep-43
/dev/vdd1      xfs      1024G 1024G        1G 100% /var/lib/keepstore/gcam1-keep-44
/dev/vde1      xfs      1024G 1024G        1G 100% /var/lib/keepstore/gcam1-keep-45

manager@keepstore07:~$ df -i /var/lib/keepstore/gcam1-keep-43 /var/lib/keepstore/gcam1-keep-44 /var/lib/keepstore/gcam1-keep-45
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/vdc1       35982 20475 15507   57% /var/lib/keepstore/gcam1-keep-43
/dev/vdd1       28438 20469  7969   72% /var/lib/keepstore/gcam1-keep-44
/dev/vde1       36414 20475 15939   57% /var/lib/keepstore/gcam1-keep-45

#3 Updated by Peter Grandi about 5 years ago

Data Manager report appended. The 447178 unattached blocks it reports match fairly well the size of the 25GiB collection that was reported as 100% uploaded. What is perplexing is that re-uploading the very same files goes as far as 30% as per above reports and then results in a no-space message.

2016/03/23 17:00:35 Returned 10 keep disks
2016/03/23 17:00:35 Replication level distribution: map[1:936463 2:25380 3:3]
2016/03/23 17:00:38 Blocks In Collections: 514668, 
Blocks In Keep: 961846.
2016/03/23 17:00:38 Replication Block Counts:
 Missing From Keep: 0, 
 Under Replicated: 0, 
 Over Replicated: 22639, 
 Replicated Just Right: 492029, 
 Not In Any Collection: 447178. 
Replication Collection Counts:
 Missing From Keep: 0, 
 Under Replicated: 0, 
 Over Replicated: 22, 
 Replicated Just Right: 395.
2016/03/23 17:00:38 Blocks Histogram:
2016/03/23 17:00:38 {Requested:0 Actual:1}:     444434
2016/03/23 17:00:38 {Requested:0 Actual:2}:       2744
2016/03/23 17:00:38 {Requested:1 Actual:1}:     492029
2016/03/23 17:00:38 {Requested:1 Actual:2}:      22636
2016/03/23 17:00:38 {Requested:1 Actual:3}:          3

#4 Updated by Peter Grandi about 5 years ago

So thanks to #6358 I discovered (or rediscovered) ARVADOS_DEBUG=2 and applied to a re-upload it showed that:

  • arv-put picks "some" Keepstore (probably according to the sorting order mentioned in #6358).
  • Tries to PUT the block to it, regardless of whether it may or may not be already on it. and if the PUT fails, goes on to the next Keepstore.
  • Eventually it will find a Keepstore where the PUT does not fail, and if it by chance it is the one where the block is already stored it won't be stored twice, else it will be.

This is rather disappointing, as it means that failed uploads create usually unwanted duplicates, that is uploads are not idempotent as to "pragmatics".

Also there is the terrible problem that if an upload of say 25TB lasts many days and perchance the Data Manager runs before the collection's manifest is registered in the API there might be some big disappointment. IIRC this is an aspect that is being worked on, I guess with a black-grey-white state system.

For arv-put these might be possible improvements (several hopefully non-critical details omitted, like replication):

  • The Data Managers or the Keepstores maintain in the API server database periodically updated list of all blocks present but not registered in any collection manifest.
  • Then arv-put optionally checks existing manifests and that list. If there is no list or the "some" blocks are not present in it, they get uploaded to "some" Keepstore.
  • If there is a list and "some" blocks are present in it:
    • arv-put sends a PUT request listing those hashes to the '/reserve' endpoint of all Keepstores.
    • Keepstores reply with a status per-hash: 0="have-and-reserved", 1=dont-have-and-waiting, 2=dont-know.
    • Hashes for which all statuses are "dont-have-and-waiting" or "dont-know": PUT the hash and block to some Keepstore's '/upload' endpoint.
    • At the end when hash is registered in a manifest sent to the API server, send a PUT to endpoint '/registered' of the relevant Keepstore.
    • A Keepstore will refuse to delete a block between its hash being PUT to the '/reserve' endpoint and it being listed in a PUT to the '/registered' endpoint.

IIRC in a previous discussion someone mentioned a more persistent mechanism for "grey" status (uploaded but not yet registered), like uploading or hard-linking the block in a directory like 'incoming' on the Keepstore volume.

To discuss more on #Arvados I guess.

#5 Updated by Peter Grandi about 5 years ago

Put another way, the current algorithm for selecting a Keepstore as the destination for a block results in no duplication among Keepstores only if the number of Keepstores never changes and (maybe) have the same amount of free space.

#6 Updated by Peter Grandi about 5 years ago

Just noted on IRC that at one point we had all keepstores 100% full, and 100% of blocks being uploaded having already been uploaded. In that case I would have expected all re-uploads to succeed, but they all failed instead instead about 30% in.

#7 Updated by Peter Amstutz about 5 years ago

We covered this on IRC, but to summarize:

Currently, for each directory arv-put concatenates all the files into a single "stream", and it is the "stream" that is chunked into 64 MiB blocks rather than individual files. This means files are not guaranteed to fall on block boundaries and can start in the middle of a block immediately following the end of a previous file. As a result, if files are uploaded in a different order, this results in a different "stream" which is likely to yield different blocks.

This is not an inherit property of Keep. Data chunking decisions are made at the client level during upload, so we can change the data chunking policy without changing any of the keep infrastructure. For example, creating collections with writable arv-mount ("arv-mount --read-write") creates a separate block stream for each file.

However, this obviously makes deduplication less effective, so I've filed #8791 to change this behavior.

#8 Updated by Tom Morris over 4 years ago

  • Status changed from New to Resolved

The new chunking behavior in #8791 should fix this.

Also available in: Atom PDF