Feature #17948
closedcreate some large collections for testing (on ce8i5, tordo, 9tee4)
Added by Ward Vandewege over 3 years ago. Updated about 3 years ago.
Description
Characteristics:
We can repeat the same block over and over so that the collection does not actually take up an enormous amount of space.
- create a garbage block of a few K
- files should be random sizes from that block
- files should have random names of different lengths
- many files in 1 directory (50k+)
- multiple directory levels
- huge manifest (close to our default 128MiB limit)
Updated by Daniel Kutyła over 3 years ago
file names should not be repeatable as this makes dataset unrealistic
Updated by Peter Amstutz over 3 years ago
- Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Updated by Lucas Di Pentima over 3 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 3 years ago
Updates at 3fc0f9610 - branch 17948-test-collection-tool
Test run: developer-run-tests: #2623
- Adds migration to RailsAPI dropping the collection's FTS index as it isn't used and avoided the creation of collections with many files.
- Adds the script
tools/test-collection-create/test-collection-create.py
that allows the creation of big collections for testing purposes:- Every collection reuses the same 1 MiB block, files go from 1 KiB up to 1 MiB of printable chars.
- File/dir names are made up of:
<adjective>_<noun>_<number>[.txt]
for easy reading. - By default the tool will create a collection with 30k files, and print its UUID once done.
- The user can specify min/max nr of files per directory and min/max depth of tree structure.
- When depth>0, every directory will get a random number of subdirs between 1 and 10, this isn't customizable but we can add it as an option.
- The tool will do its best to accomplish what the user requested but will cap the manifest size to 128 MiB or a little less.
Pending issue:¶
For some reason, I'm getting errors like:
Traceback (most recent call last): File "tools/test-collection-create/test-collection-create.py", line 165, in <module> sys.exit(main()) File "tools/test-collection-create/test-collection-create.py", line 155, in main "manifest_text": manifest File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 835, in execute method=str(self.method), body=self.body, headers=self.headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 179, in _retry_request raise exception File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 162, in _retry_request resp, content = http.request(uri, method, *args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/arvados/api.py", line 111, in _intercept_http_request return self.orig_http_request(uri, method, headers=headers, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1709, in request conn, authority, uri, request_uri, method, body, headers, redirections, cachekey, File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1424, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1376, in _conn_request response = conn.getresponse() File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse response.begin() File "/usr/lib/python3.7/http/client.py", line 310, in begin version, status, reason = self._read_status() File "/usr/lib/python3.7/http/client.py", line 271, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.7/ssl.py", line 911, in read return self._sslobj.read(len, buffer) socket.timeout: The read operation timed out
...after around 12 seconds of waiting for the collection create call.
I've tried to overcome this issue by doing:- Passed
timeout=300
toarvados.api()
. - Used
socket.settimeout(300)
to set the default at the lowest level. - Patched
socket.getdefaulttimeout()
to always return300.0
so thathttplib2
would get what I want. - Tried setting the timeouts to 1 second.
...nothing changed the behavior: it timeouts after ~12 seconds (but the collection creation call succeeds).
Any additional ideas will we welcome :)
Updated by Peter Amstutz over 3 years ago
Reviewing 17948-test-collection-tool @ e3f75342a771d8c4a2216efd98e9ea60286e4280
- The 13000 character lines for 'adjectives' and 'nouns' make my text editor a little bit unhappy, maybe we could stash them in another python file which is imported
- I'd like to be able to set the min/max number of subdirectories per directory, in case the performance characteristics of a collection with a very large number of directories is different from a very large number of files
- The socket timeouts are frustrating and need to be investigated, but so long as we are able to create some large test collections, that will be good enough to unblock testing.
Updated by Peter Amstutz over 3 years ago
Even better, keep in inline and reformat the 'adjectives' and 'nouns' lines to include multiple words but have line breaks so each line is less than <100 chars long.
Which might still be 300 lines of inline data but text editors will handle it much better.
Updated by Peter Amstutz over 3 years ago
Also the second part of this story is to actually create a few huge collections on the dev/test clusters, you can put record the UUIDs here once you've done that.
Updated by Lucas Di Pentima over 3 years ago
Updates at e7f550693
- Splits the huge list into multiple lines
- Adds
--min-subdirs
and--max-subdirs
with defaults (min=1, max=10)
To be able to create the test collections on the clusters, we need to drop de FTS index, should I merge the entire branch first or just cherry-pick the commit with the migration into another branch?
Updated by Lucas Di Pentima over 3 years ago
- Blocked by Idea #15430: [API] Remove the @@ list method filter added
Updated by Lucas Di Pentima over 3 years ago
Updated by Lucas Di Pentima over 3 years ago
- Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Updated by Peter Amstutz over 3 years ago
17948-test-collection-tool @ 261e856e7f2fe2b65b7a83f2ac70b3e35e852f3e
This LGTM, thanks!
Updated by Lucas Di Pentima over 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|c340eecc7a03dd066792e5f046f087b8b3dfced6.