Project

General

Profile

Actions

Bug #6844

closed

[SDKs] arv-copy consistently gets MemoryError trying to copy an 8GiB collection

Added by Brett Smith over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
1.0

Description

qr1hi-4zz18-c1t7uidtqs5sg82 is public reference data for the Tuxedo pipeline. It's 8GiB.

I am trying to copy this data, using a shell node that has 4GiB RAM and 4GiB swap and current master arv-copy. While it's copying blocks, it eventually gets a MemoryError, and then the whole copy falls apart. Here's a transcript leading up to the badness, with ARVADOS_DEBUG enabled:

qr1hi-4zz18-c1t7uidtqs5sg82: 3904M / 8184M 47.7% 2015-07-31 18:48:02 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:02 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/c948eb1f7afcaf183552c531c08dc114+67108864+A…
2015-07-31 18:48:03 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1650.02799034 msec (38.787 MiB/sec)
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: c948eb1f7afcaf183552c531c08dc114+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> proceeding c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:03 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/c948eb1f7afcaf183552c531c08dc114
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-123, started 139671556409088)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-124, started 139671564801792)> succeeded c948eb1f7afcaf183552c531c08dc114+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 3968M / 8184M 48.5% 2015-07-31 18:48:05 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:05 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7c8de9368ec9ad742a3a27336a134dc5+67108864+A…
2015-07-31 18:48:07 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1569.37599182 msec (40.781 MiB/sec)
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: 7c8de9368ec9ad742a3a27336a134dc5+67108864: ['http://keep1.xyzzy.arvadosapi.com:25107/', 'http://keep0.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> proceeding 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:07 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7c8de9368ec9ad742a3a27336a134dc5
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-125, started 139671564801792)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-126, started 139671556409088)> succeeded 7c8de9368ec9ad742a3a27336a134dc5+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 4032M / 8184M 49.3% 2015-07-31 18:48:09 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:09 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/7ac2a965fe398a919e86392d67bce52b+67108864+A…
2015-07-31 18:48:10 arvados.keep[11112] INFO: 200 response: 67108864 bytes in 1575.77109337 msec (40.615 MiB/sec)
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: 7ac2a965fe398a919e86392d67bce52b+67108864: ['http://keep0.xyzzy.arvadosapi.com:25107/', 'http://keep1.xyzzy.arvadosapi.com:25107/']
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep1.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> proceeding 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:10 arvados.keep[11112] DEBUG: Request: PUT http://keep0.xyzzy.arvadosapi.com:25107/7ac2a965fe398a919e86392d67bce52b
2015-07-31 18:48:12 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-128, started 139671564801792)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep0.xyzzy.arvadosapi.com:25107/
2015-07-31 18:48:13 arvados.keep[11112] DEBUG: KeepWriterThread <KeepWriterThread(Thread-127, started 139671556409088)> succeeded 7ac2a965fe398a919e86392d67bce52b+67108864 http://keep1.xyzzy.arvadosapi.com:25107/
qr1hi-4zz18-c1t7uidtqs5sg82: 4096M / 8184M 50.0% 2015-07-31 18:48:13 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:13 arvados.keep[11112] DEBUG: Request: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…
MemoryError: out of memory
2015-07-31 18:48:14 arvados.keep[11112] DEBUG: Request fail: GET https://keep.qr1hi.arvadosapi.com:443/22e4cbeae1375cb0aea342ab68ddbfef+67108864+A… => <class 'arvados.errors.HttpError'>:
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: {u'qr1hi-bi6l4-ebw6id3jibfc3cm': OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])}
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: [OrderedDict([(u'href', u'/keep_services/qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'kind', u'arvados#keepService'), (u'etag', u'dlywp1t0nne9dw3gds9tcyhs3'), (u'uuid', u'qr1hi-bi6l4-ebw6id3jibfc3cm'), (u'owner_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'created_at', u'2014-07-08T02:19:44.015581000Z'), (u'modified_by_client_uuid', u'qr1hi-ozdt8-obw7foaks3qjyej'), (u'modified_by_user_uuid', u'qr1hi-tpzed-327pde5hktairf9'), (u'modified_at', u'2014-07-08T02:20:11.712252000Z'), (u'service_host', u'keep.qr1hi.arvadosapi.com'), (u'service_port', 443), (u'service_ssl_flag', True), (u'service_type', u'proxy'), (u'read_only', False), ('_service_root', 'https://keep.qr1hi.arvadosapi.com:443/')])]
2015-07-31 18:48:17 arvados.keep[11112] DEBUG: 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: ['https://keep.qr1hi.arvadosapi.com:443/']
2015-07-31 18:48:17 arvados.keep[11112] INFO: No more Keep services to try; giving up
Traceback (most recent call last):
  File "/home/brett/pyvenv/bin/arv-copy", line 4, in <module>
    __import__('pkg_resources').run_script('arvados-python-client==0.1', 'arv-copy')
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 735, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1652, in run_script
    exec(code, namespace, namespace)
  File "/home/brett/pyvenv/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/EGG-INFO/scripts/arv-copy", line 4, in <module>
    main()
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 119, in main
    args)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/commands/arv_copy.py", line 563, in copy_collection
    data = src_keep.get(word)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/retry.py", line 154, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/home/brett/pyvenv/local/lib/python2.7/site-packages/arvados_python_client-0.1-py2.7.egg/arvados/keep.py", line 907, in get
    "failed to read {}".format(loc_s), service_errors, label="service")
arvados.errors.KeepReadError: failed to read 22e4cbeae1375cb0aea342ab68ddbfef+67108864+A…: service https://keep.qr1hi.arvadosapi.com:443/ responded with 0 (23, 'Failed writing body (0 != 16384)')

I note that the KeepWriterThread ids keep going up monotonically. Is it possible they're not being released and eventually garbage collected?


Subtasks 4 (0 open4 closed)

Task #6908: Plug leakResolvedTom Clegg07/31/2015Actions
Task #6905: InvestigateResolvedTom Clegg07/31/2015Actions
Task #6906: Reproduce leakage in a test caseClosedTom Clegg07/31/2015Actions
Task #6903: Review 6844-py-mem-leakResolvedRadhika Chippada07/31/2015Actions
Actions #1

Updated by Peter Amstutz over 8 years ago

Check the keepproxy logs. This is a case where the real error is hidden from view.

Actions #2

Updated by Brett Smith over 8 years ago

Peter Amstutz wrote:

Check the keepproxy logs. This is a case where the real error is hidden from view.

Everything's fine there. The "MemoryError: out of memory" line earlier in the transcript suggests that this a purely client-side OOM condition.

Actions #3

Updated by Brett Smith over 8 years ago

I spent an hour digging into this and didn't immediately find the source. I spent some time scrutinizing our PyCurl client pool, but everything seems on the up-and-up there. And the writer threads etc. should be garbage collectible after put() returns.

I wonder if the issue is with our use of PyCurl itself? Should we be closing or releasing objects that we're currently not?

You might have to do some instrumentation to check the refcounts of live objects to help track this down.

Actions #4

Updated by Tom Clegg over 8 years ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg over 8 years ago

  • Story points set to 1.0
Actions #6

Updated by Tom Clegg over 8 years ago

6844-py-mem-leak @ f809883

The fix is in 0988acb

Bad news: lost a feature.
  • The memory leak was @retry_method's fault. Evidently inspect.getcallargs causes reference cycles (or something like that) which is bad if an arg is often, say, a 64 MiB string. I tried various things with "del" to no avail, eventually decided that the ability to accept num_retries as a positional argument isn't important enough to worry about too much. Found a total of 3 such uses, including the test case for that specific feature.
Good news: some other fixes, for things noticed along the way
  • Clean DB after each websocket test -- and make "clean DB" clean the humans table too (even though we have no fixtures there). Now you can leave an API server up and run Python SDK tests against it repeatedly, without "websocket" leaving stuff in the DB that breaks the "empty list" test.
  • Remove arv-copy code that preserves the "missing trailing newline" manifest format error when copying a collection.
  • Remove arv-copy code that uselessly stores an unsigned manifest in Keep.
  • Fix overly broad exception-catching in arv-copy
  • Cleanup sdk/python websocket tests (faster, test more things)
Actions #7

Updated by Tom Clegg over 8 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Tom Clegg over 8 years ago

I'd still like to see about making a memory-leak test case for this, but I don't think it should block merging the fix -- and maybe it shouldn't even block closing the issue, if it turns out to be hard to test reliably.

Actions #9

Updated by Radhika Chippada over 8 years ago

LGTM

Actions #10

Updated by Tom Clegg over 8 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:2a96c097e5a176018d078a5d6985403072e8672e.

Actions #11

Updated by Brett Smith over 8 years ago

FWIW I was able to use the latest version to copy the original collection. Thanks.

Actions

Also available in: Atom PDF