Bug #6844

[SDKs] arv-copy consistently gets MemoryError trying to copy an 8GiB collection

Added by Brett Smith about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
07/31/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

qr1hi-4zz18-c1t7uidtqs5sg82 is public reference data for the Tuxedo pipeline. It's 8GiB.

I am trying to copy this data, using a shell node that has 4GiB RAM and 4GiB swap and current master arv-copy. While it's copying blocks, it eventually gets a MemoryError, and then the whole copy falls apart. Here's a transcript leading up to the badness, with ARVADOS_DEBUG enabled:

qr1hi-4zz18-c1t7uidtqs5sg82: 3904M / 8184M 47.7% 2015-07-31 18:48:02 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:02 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/c948eb1f7afcaf183552c531c08dc114+67108864+A…
2015-07-31 18:48:03 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1650.02799034 msec (38.787 MiB/sec)
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 3968M / 8184M 48.5% 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7c8de9368ec9ad742a3a27336a134dc5+67108864+A…
2015-07-31 18:48:07 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1569.37599182 msec (40.781 MiB/sec)
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864: ['http://keep1.xyzzy.arvadosapi.com:25107/', 'http://keep0.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 4032M / 8184M 49.3% 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7ac2a965fe398a919e86392d67bce52b+67108864+A…
2015-07-31 18:48:10 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1575.77109337 msec (40.615 MiB/sec)
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b
2015-07-31 18:48:12 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:13 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 4096M / 8184M 50.0% 2015-07-31 18:48:13 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:13 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…
MemoryError: out of memory
2015-07-31 18:48:14 arvados.keep[11112] DEBUG: Request fail: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A… => <class 'arvados.errors.HttpError'>:
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: {u'qr1hi-bi6l4-ebw6id3jibfc3cm': OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])}
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: [OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])]
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:17 arvados.keep[11112] INFO: No more Keep services to try; giving up
Traceback (most recent call last):
  File "/home/brett/pyvenv/bin/arv-copy", line 4, in <module>
    __import__('pkg_resources').run_script('arvados-python-client==0.1', 'arv-copy')
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 735, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1652, in run_script
    exec(code, namespace, namespace)
  File "/home/brett/pyvenv/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/EGG-INFO/scripts/arv-copy", line 4, in <module>
    main()
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 119, in main
    args)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 563, in copy_collection
    data = src_keep.get(word)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/keep.py", line 907, in get
    "failed to read {}".format(loc_s), service_errors, label="service")
arvados.errors.KeepReadError: failed to read 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: service https://keep.qr1hi.arvadosapi.com:443/ responded with 0 (23, 'Failed writing body (0 != 16384)')

I note that the KeepWriterThread ids keep going up monotonically. Is it possible they're not being released and eventually garbage collected?


Subtasks

Task #6908: Plug leakResolvedTom Clegg

Task #6905: InvestigateResolvedTom Clegg

Task #6906: Reproduce leakage in a test caseClosedTom Clegg

Task #6903: Review 6844-py-mem-leakResolvedRadhika Chippada

Associated revisions

Revision 2a96c097
Added by Tom Clegg about 4 years ago

Merge branch '6844-py-mem-leak' closes #6844

History

#1 Updated by Peter Amstutz about 4 years ago

Check the keepproxy logs. This is a case where the real error is hidden from view.

#2 Updated by Brett Smith about 4 years ago

Peter Amstutz wrote:

Check the keepproxy logs. This is a case where the real error is hidden from view.

Everything's fine there. The "MemoryError: out of memory" line earlier in the transcript suggests that this a purely client-side OOM condition.

#3 Updated by Brett Smith about 4 years ago

I spent an hour digging into this and didn't immediately find the source. I spent some time scrutinizing our PyCurl client pool, but everything seems on the up-and-up there. And the writer threads etc. should be garbage collectible after put() returns.

I wonder if the issue is with our use of PyCurl itself? Should we be closing or releasing objects that we're currently not?

You might have to do some instrumentation to check the refcounts of live objects to help track this down.

#4 Updated by Tom Clegg about 4 years ago

  • Assigned To set to Tom Clegg

#5 Updated by Tom Clegg about 4 years ago

  • Story points set to 1.0

#6 Updated by Tom Clegg about 4 years ago

6844-py-mem-leak @ f809883

The fix is in 0988acb

Bad news: lost a feature.
  • The memory leak was @retry_method's fault. Evidently inspect.getcallargs causes reference cycles (or something like that) which is bad if an arg is often, say, a 64 MiB string. I tried various things with "del" to no avail, eventually decided that the ability to accept num_retries as a positional argument isn't important enough to worry about too much. Found a total of 3 such uses, including the test case for that specific feature.
Good news: some other fixes, for things noticed along the way
  • Clean DB after each websocket test -- and make "clean DB" clean the humans table too (even though we have no fixtures there). Now you can leave an API server up and run Python SDK tests against it repeatedly, without "websocket" leaving stuff in the DB that breaks the "empty list" test.
  • Remove arv-copy code that preserves the "missing trailing newline" manifest format error when copying a collection.
  • Remove arv-copy code that uselessly stores an unsigned manifest in Keep.
  • Fix overly broad exception-catching in arv-copy
  • Cleanup sdk/python websocket tests (faster, test more things)

#7 Updated by Tom Clegg about 4 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg about 4 years ago

I'd still like to see about making a memory-leak test case for this, but I don't think it should block merging the fix -- and maybe it shouldn't even block closing the issue, if it turns out to be hard to test reliably.

#9 Updated by Radhika Chippada about 4 years ago

LGTM

#10 Updated by Tom Clegg about 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:2a96c097e5a176018d078a5d6985403072e8672e.

#11 Updated by Brett Smith about 4 years ago

FWIW I was able to use the latest version to copy the original collection. Thanks.

Also available in: Atom PDF