create some large collections for testing (on ce8i5, tordo, 9tee4)
We can repeat the same block over and over so that the collection does not actually take up an enormous amount of space.
- create a garbage block of a few K
- files should be random sizes from that block
- files should have random names of different lengths
- many files in 1 directory (50k+)
- multiple directory levels
- huge manifest (close to our default 128MiB limit)
Updated by Lucas Di Pentima about 1 year ago
- Adds migration to RailsAPI dropping the collection's FTS index as it isn't used and avoided the creation of collections with many files.
- Adds the script
tools/test-collection-create/test-collection-create.pythat allows the creation of big collections for testing purposes:
- Every collection reuses the same 1 MiB block, files go from 1 KiB up to 1 MiB of printable chars.
- File/dir names are made up of:
<adjective>_<noun>_<number>[.txt]for easy reading.
- By default the tool will create a collection with 30k files, and print its UUID once done.
- The user can specify min/max nr of files per directory and min/max depth of tree structure.
- When depth>0, every directory will get a random number of subdirs between 1 and 10, this isn't customizable but we can add it as an option.
- The tool will do its best to accomplish what the user requested but will cap the manifest size to 128 MiB or a little less.
For some reason, I'm getting errors like:
Traceback (most recent call last): File "tools/test-collection-create/test-collection-create.py", line 165, in <module> sys.exit(main()) File "tools/test-collection-create/test-collection-create.py", line 155, in main "manifest_text": manifest File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper return wrapped(*args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 835, in execute method=str(self.method), body=self.body, headers=self.headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 179, in _retry_request raise exception File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 162, in _retry_request resp, content = http.request(uri, method, *args, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/arvados/api.py", line 111, in _intercept_http_request return self.orig_http_request(uri, method, headers=headers, **kwargs) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1709, in request conn, authority, uri, request_uri, method, body, headers, redirections, cachekey, File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1424, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1376, in _conn_request response = conn.getresponse() File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse response.begin() File "/usr/lib/python3.7/http/client.py", line 310, in begin version, status, reason = self._read_status() File "/usr/lib/python3.7/http/client.py", line 271, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.7/ssl.py", line 911, in read return self._sslobj.read(len, buffer) socket.timeout: The read operation timed out
...after around 12 seconds of waiting for the collection create call.I've tried to overcome this issue by doing:
socket.settimeout(300)to set the default at the lowest level.
socket.getdefaulttimeout()to always return
httplib2would get what I want.
- Tried setting the timeouts to 1 second.
...nothing changed the behavior: it timeouts after ~12 seconds (but the collection creation call succeeds).
Any additional ideas will we welcome :)
Updated by Peter Amstutz 12 months ago
Reviewing 17948-test-collection-tool @ e3f75342a771d8c4a2216efd98e9ea60286e4280
- The 13000 character lines for 'adjectives' and 'nouns' make my text editor a little bit unhappy, maybe we could stash them in another python file which is imported
- I'd like to be able to set the min/max number of subdirectories per directory, in case the performance characteristics of a collection with a very large number of directories is different from a very large number of files
- The socket timeouts are frustrating and need to be investigated, but so long as we are able to create some large test collections, that will be good enough to unblock testing.
Updated by Peter Amstutz 12 months ago
Even better, keep in inline and reformat the 'adjectives' and 'nouns' lines to include multiple words but have line breaks so each line is less than <100 chars long.
Which might still be 300 lines of inline data but text editors will handle it much better.
Updated by Lucas Di Pentima 12 months ago
Updates at e7f550693
- Splits the huge list into multiple lines
--max-subdirswith defaults (min=1, max=10)
To be able to create the test collections on the clusters, we need to drop de FTS index, should I merge the entire branch first or just cherry-pick the commit with the migration into another branch?