Bug #8998
closed[API] Memory overflow when dumping a 25TB collection as JSON
Description
When uploading a collection of nearly 25TiB in a bit over 3,000 files with arv-put
the outcome was:
librarian@rockall$ arv-put --replication 1 --no-resume --project-uuid gcam1-j7d0g-k25rlhe6ig8p9na --name DDD_WGS_EGAD00001001114 DDDP* 25956153M / 25956153M 100.0% arv-put: Error creating Collection on project: <HttpError 422 when requesting https://gcam1.example.com/arvados/v1/collections?ensure_unique_name=true&alt=json returned "#<NoMemoryError: failed to allocate memory>">. Traceback (most recent call last): File "/usr/local/bin/arv-put", line 4, in <module> main() File "/usr/local/lib/python2.7/dist-packages/arvados/commands/put.py", line 533, in main stdout.write(output) UnboundLocalError: local variable 'output' referenced before assignment
From logs:
Oh... fiddlesticks. An error occurred when Workbench sent a request to the Arvados API server. Try reloading this page. If the problem is temporary, your request might go through next time. If that doesn't work, the information below can help system administrators track down the problem. API request URL https://gcam1.camdc.genomicsplc.com/arvados/v1/collections/gcam1-4zz18-i4nlpovriwdxu6j API response { ":errors":[ "#<NoMemoryError: failed to allocate memory>" ], ":error_token":"1457687649+b12feaf3" }
and I have attached the longer backtrace from a previous log.
A 25TB upload should result in a 15MB manifest, large but it should not overflow the API server that has 4GiB of memory.
Anyhow we can allocate more GiB of memory, but it would be nice to have a guideline as to how many are needed in relationship to largest collection size.
Perhaps 25TB collections are too large, especially considering the resulting manifest size, and my understanding that any access to a file in a collection results in the latency of a download of the full manifest.
But I have been told that we have a requirement for arbitrary naming conventions, where it is not acceptable to split large sets of data (many small files or fewer large files) into separate collections, like "data-subset-1", "data-subset-2", "data-subset-3", ... solely because of storage system limitations.
Files