Project

General

Profile

Actions

Feature #17948

closed

create some large collections for testing (on ce8i5, tordo, 9tee4)

Added by Ward Vandewege over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Characteristics:

We can repeat the same block over and over so that the collection does not actually take up an enormous amount of space.

  • create a garbage block of a few K
  • files should be random sizes from that block
  • files should have random names of different lengths
  • many files in 1 directory (50k+)
  • multiple directory levels
  • huge manifest (close to our default 128MiB limit)

Subtasks 1 (0 open1 closed)

Task #17970: Review 17948-test-collection-toolResolvedPeter Amstutz08/06/2021Actions

Related issues

Blocked by Arvados - Idea #15430: [API] Remove the @@ list method filterResolvedLucas Di Pentima08/16/2021Actions
Actions #1

Updated by Ward Vandewege over 2 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege over 2 years ago

  • Description updated (diff)
Actions #3

Updated by Daniel Kutyła over 2 years ago

file names should not be repeatable as this makes dataset unrealistic

Actions #4

Updated by Peter Amstutz over 2 years ago

  • Target version changed from 2021-08-04 sprint to 2021-08-18 sprint
Actions #5

Updated by Peter Amstutz over 2 years ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz over 2 years ago

  • Assigned To set to Lucas Di Pentima
Actions #7

Updated by Lucas Di Pentima over 2 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Lucas Di Pentima over 2 years ago

Updates at 3fc0f9610 - branch 17948-test-collection-tool
Test run: developer-run-tests: #2623

  • Adds migration to RailsAPI dropping the collection's FTS index as it isn't used and avoided the creation of collections with many files.
  • Adds the script tools/test-collection-create/test-collection-create.py that allows the creation of big collections for testing purposes:
    • Every collection reuses the same 1 MiB block, files go from 1 KiB up to 1 MiB of printable chars.
    • File/dir names are made up of: <adjective>_<noun>_<number>[.txt] for easy reading.
    • By default the tool will create a collection with 30k files, and print its UUID once done.
    • The user can specify min/max nr of files per directory and min/max depth of tree structure.
    • When depth>0, every directory will get a random number of subdirs between 1 and 10, this isn't customizable but we can add it as an option.
    • The tool will do its best to accomplish what the user requested but will cap the manifest size to 128 MiB or a little less.

Pending issue:

For some reason, I'm getting errors like:

Traceback (most recent call last):
  File "tools/test-collection-create/test-collection-create.py", line 165, in <module>
    sys.exit(main())
  File "tools/test-collection-create/test-collection-create.py", line 155, in main
    "manifest_text": manifest
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 835, in execute
    method=str(self.method), body=self.body, headers=self.headers)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 179, in _retry_request
    raise exception
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/googleapiclient/http.py", line 162, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/arvados/api.py", line 111, in _intercept_http_request
    return self.orig_http_request(uri, method, headers=headers, **kwargs)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1709, in request
    conn, authority, uri, request_uri, method, body, headers, redirections, cachekey,
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1424, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/home/lucas/venv-arvados/lib/python3.7/site-packages/httplib2/__init__.py", line 1376, in _conn_request
    response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1352, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 310, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 271, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

...after around 12 seconds of waiting for the collection create call.

I've tried to overcome this issue by doing:
  • Passed timeout=300 to arvados.api().
  • Used socket.settimeout(300) to set the default at the lowest level.
  • Patched socket.getdefaulttimeout() to always return 300.0 so that httplib2 would get what I want.
  • Tried setting the timeouts to 1 second.

...nothing changed the behavior: it timeouts after ~12 seconds (but the collection creation call succeeds).

Any additional ideas will we welcome :)

Actions #9

Updated by Peter Amstutz over 2 years ago

Reviewing 17948-test-collection-tool @ e3f75342a771d8c4a2216efd98e9ea60286e4280

  • The 13000 character lines for 'adjectives' and 'nouns' make my text editor a little bit unhappy, maybe we could stash them in another python file which is imported
  • I'd like to be able to set the min/max number of subdirectories per directory, in case the performance characteristics of a collection with a very large number of directories is different from a very large number of files
  • The socket timeouts are frustrating and need to be investigated, but so long as we are able to create some large test collections, that will be good enough to unblock testing.
Actions #10

Updated by Peter Amstutz over 2 years ago

Even better, keep in inline and reformat the 'adjectives' and 'nouns' lines to include multiple words but have line breaks so each line is less than <100 chars long.

Which might still be 300 lines of inline data but text editors will handle it much better.

Actions #11

Updated by Peter Amstutz over 2 years ago

Also the second part of this story is to actually create a few huge collections on the dev/test clusters, you can put record the UUIDs here once you've done that.

Actions #12

Updated by Lucas Di Pentima over 2 years ago

Updates at e7f550693

  • Splits the huge list into multiple lines
  • Adds --min-subdirs and --max-subdirs with defaults (min=1, max=10)

To be able to create the test collections on the clusters, we need to drop de FTS index, should I merge the entire branch first or just cherry-pick the commit with the migration into another branch?

Actions #13

Updated by Lucas Di Pentima over 2 years ago

  • Blocked by Idea #15430: [API] Remove the @@ list method filter added
Actions #14

Updated by Lucas Di Pentima over 2 years ago

Updates (rebased branch) at 261e856

  • Dropped commits related to FTS indexes because they were moved to #15430

Waiting for #15430 to be merged to create some big collections on the dev clusters.

Actions #15

Updated by Lucas Di Pentima over 2 years ago

  • Target version changed from 2021-08-18 sprint to 2021-09-01 sprint
Actions #16

Updated by Peter Amstutz over 2 years ago

17948-test-collection-tool @ 261e856e7f2fe2b65b7a83f2ac70b3e35e852f3e

This LGTM, thanks!

Actions #17

Updated by Lucas Di Pentima over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #18

Updated by Peter Amstutz over 2 years ago

  • Release set to 42
Actions

Also available in: Atom PDF