Project

General

Profile

Actions

Bug #8769

closed

re-upload seems to consume a lot of space

Added by Peter Grandi about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
-
Story points:
-

Description

We had a 30TiB Keep setup (5x Keepstore nodes each with 6x 1TiB Keepstore volumes) and added another 30TiB (same setup).

Then we uploaded a 25TiB collection. This failed with:


librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114 DDDP*
25956153M / 25956153M 100.0%
arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">.
Traceback (most recent call last):
 File "/usr/local/bin/arv-put", line 4, in <module>
   main()
 File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main
   stdout.write(output)
UnboundLocalError: local variable 'output' referenced before assignment

Attachments

We have then started to re-uploaded the 25TiB collection as 6x subsets, 3x at a time, and all 3 first re-uploads failed because of running out space as in:

librarian@sole$ time arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114_4 $(< ~/l4)
1241152M / 4228192M 29.4% Traceback (most recent call last):
  File "/usr/local/bin/arv-put", line 4, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 484, in main
    path, max_manifest_depth=args.max_manifest_depth)
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 334, in write_directory_tree
    path, stream_name, max_manifest_depth)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 216, in write_directory_tree
    self.do_queued_work()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 144, in do_queued_work
    self._work_file()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 157, in _work_file
    self.write(buf)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 471, in write
    return super(ResumableCollectionWriter, self).write(data)
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 227, in write
    self.flush_data()
  File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 310, in flush_data
    super(ArvPutCollectionWriter, self).flush_data()
  File "/usr/local/lib/python2.7/dist-packages/arvados/collection.py", line 264, in flush_data
    copies=self.replication))
  File "/usr/local/lib/python2.7/dist-packages/arvados/retry.py", line 153, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/arvados/keep.py", line 1065, in put
    data_hash, copies, thread_limiter.done()), service_errors, label="service")
arvados.errors.KeepWriteError: failed to write 041e9f3b83a075608ee1227acc757b0c (wanted 1 copies but wrote 0): service http://keep9.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep0.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep2.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep4.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep5.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep7.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep8.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep1.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep6.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable; service http://keep3.gcam1.example.com:25107/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable

real    2226m47.733s
user    135m7.266s
sys     116m52.827s

The 'arv-put' command is from the Debian package dated 160311.

What perplexed me in the above is that there was still quite a bit of free space. In the attached free space report the inflection point around "Friday" is when the re-upload was started. I was surprised to see fast decreasing space for uploads of content that had already been allegedly 100% uploaded.

I have enumerated all the blocks on all 10 Keepstore servers and there around 950k, with around 24k duplicates (and 6 triplicates), that is there are only 1.5TB of duplicates. Also those duplicates are entirely on two Keepstores that were part of the first set of 5, which had filled up before the re-upload (bottom yellow and orange in the graph). There is a perhaps a chance that on the original upload the "25956153M / 25956153M 100.0%" report might have been optimistic.

What worries me is the possibility that different hashes may be assigned to the same content. Suggestions and comments would be interesting.


Files

keepstoresGbFree.png (29.6 KB) keepstoresGbFree.png Peter Grandi, 03/22/2016 03:49 PM

Related issues

Related to Arvados - Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationDuplicateActions
Actions

Also available in: Atom PDF