Possible arv-put hang when writing collection
Reported by user that arv-put would upload a directory of files, and then sometimes hang before writing the collection. However, the checkpoint file was written, so canceling the process and re-running arv-put would create the collection without waiting for a re-upload.
Inspect the code and see if there are any places that seems vulnerable to a deadlock.
Here's the follow-up (https://support.curii.com/rt/Ticket/Display.html?id=119)
I would like to report a possible bug/improvement for the arv-put
command. We ran into some issues when using arv-put where it would die
silently without giving any output whatsoever. We have now traced it
to the fact that the arv-put cmd essentially runs out of memory (or
uses a huge amount of memory).
1. A folder containing a number of files (< 1500) with a total folder
size of 145GB. This entire folder is to be uploaded into Arvados.
2. We run it via Gitlab as a Runner on a Virtual Machine with 16GB of RAM.
3. The arv-put cmd we use:
arv-put --no-follow-links --no-resume --exclude 'Thumbnail_Images/*'
--exclude done.txt --project-uuid arkau-j7d0g-6a3em925c3yvx9q --name
1. The script silently dies, no error message, no other output.
We have done extensive testing and checking and initially, the arv-put
cmd just died silently without giving any error message whatsoever.
After some digging, it turns out that arv-put cmd essentially eats up
all the memory on the machine and is then killed. We tried to change
it so that arrv-put can only use 1 thread but the outcome is the same.
See the attached images for the output from 'top' when trying to
upload the 145GB folder. We have plans in the future to upload folders
with around 750GB of data and if arv-put cannot handle this or needs a
huge amount of memory to do this, we will need to reconsider our
We have a couple of questions:
1. What is the relationship between the size of the folder to be
uploaded and the amount of memory arv-put will use?
2. Is there a way to estimate how much memory would be needed for a
certain folder/size of data?
3. Is there a way to make arv-put fail gracefully in cases like this?
4. If known, what is the reason that arv-put uses so much memory?
#6 Updated by Lucas Di Pentima 5 months ago
arv-put debug logs inspection, I've seen that 228 blocks failed to be sent to one of the keepstores (all blocks sent to that keepstore failed) so they were sent to an alternative, and if we assume that all blocks are 64MB, the total gives us ~14.6GB of data... almost the amount of RAM the VM has.
I'll be reviewing the PySDK keep client code to see if I can detect any memory leak situation, and also try to set up a simulated environment to reproduce the problem.
#7 Updated by Lucas Di Pentima 5 months ago
Ok, so I've been able to reproduce the issue using arvbox.
Indeed it seems the problem happens when
arv-put (or the PySDK keep client) talks directly to keepstore servers, and one of the keepstores is either down or responding a 500 error. The RAM usage starts to creep up until the process gets killed.
I did a similar test with all the keepstores working and the RAM usage keeps below 500 MB using
#11 Updated by Lucas Di Pentima 4 months ago
After some time doing different tests, I got to a fix that I'm yet not sure why it makes a difference:
diff --git a/sdk/python/arvados/keep.py b/sdk/python/arvados/keep.py index bc43b849c..bd0e5dc1e 100644 --- a/sdk/python/arvados/keep.py +++ b/sdk/python/arvados/keep.py @@ -648,7 +648,7 @@ class KeepClient(object): class KeepWriterThread(threading.Thread): - TaskFailed = RuntimeError() + class TaskFailed(RuntimeError): pass def __init__(self, queue, data, data_hash, timeout=None): super(KeepClient.KeepWriterThread, self).__init__() @@ -667,7 +667,7 @@ class KeepClient(object): try: locator, copies = self.do_task(service, service_root) except Exception as e: - if e is not self.TaskFailed: + if not isinstance(e, self.TaskFailed): _logger.exception("Exception in KeepWriterThread") self.queue.write_fail(service) else: @@ -687,7 +687,7 @@ class KeepClient(object): self.data_hash, result['status_code'], result['body']) - raise self.TaskFailed + raise self.TaskFailed() _logger.debug("KeepWriterThread %s succeeded %s+%i %s", str(threading.current_thread()),
The way I was able to reproduce the bug is by using arvbox on a RAM restricted VM (eg: 1.5 GB), shutting down one of the keepstores, and then by accessing to arvbox's shell, used
arv-put to upload a dir with many GBs (supposedly with 2X the VM's RAM is enough). Using
arv-put from arvbox's shell is important because this happens when accessing to keepstores directly and one of them fails.
#12 Updated by Lucas Di Pentima 4 months ago
Updates at 0ea8f67a0 - branch
Test run: https://ci.arvados.org/job/developer-run-tests/2236/
WB integration test re-run: https://ci.arvados.org/job/developer-run-tests-apps-workbench-integration/2368/
- Changes the way custom exception raising is done on
KeepWriterThread, as described on the note above.
Following what's described on https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/, I tried to clear the
e.__traceback__ object on some places that I saw the exceptions being assigned to a local scoped variable, but didn't see any meaningful memory usage difference: when all keepstores are working,
arv-put uses around 560 MB of resident memory, and when one of the keepstore is down, it uses around 800 MB.