Bug #17118

Possible arv-put hang when writing collection

Added by Peter Amstutz 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
12/23/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

Reported by user that arv-put would upload a directory of files, and then sometimes hang before writing the collection. However, the checkpoint file was written, so canceling the process and re-running arv-put would create the collection without waiting for a re-upload.

Inspect the code and see if there are any places that seems vulnerable to a deadlock.

Here's the follow-up (https://support.curii.com/rt/Ticket/Display.html?id=119)

I would like to report a possible bug/improvement for the arv-put
command. We ran into some issues when using arv-put where it would die
silently without giving any output whatsoever. We have now traced it
to the fact that the arv-put cmd essentially runs out of memory (or
uses a huge amount of memory).

The setup:
1. A folder containing a number of files (< 1500) with a total folder
size of 145GB. This entire folder is to be uploaded into Arvados.
2. We run it via Gitlab as a Runner on a Virtual Machine with 16GB of RAM.
3. The arv-put cmd we use:
arv-put --no-follow-links --no-resume --exclude 'Thumbnail_Images/*'
--exclude done.txt --project-uuid arkau-j7d0g-6a3em925c3yvx9q --name
Overnight1 /isilon/nrd_hca/Overnight1/

Output:
1. The script silently dies, no error message, no other output.

We have done extensive testing and checking and initially, the arv-put
cmd just died silently without giving any error message whatsoever.
After some digging, it turns out that arv-put cmd essentially eats up
all the memory on the machine and is then killed. We tried to change
it so that arrv-put can only use 1 thread but the outcome is the same.
See the attached images for the output from 'top' when trying to
upload the 145GB folder. We have plans in the future to upload folders
with around 750GB of data and if arv-put cannot handle this or needs a
huge amount of memory to do this, we will need to reconsider our
workflows.

We have a couple of questions:
1. What is the relationship between the size of the folder to be
uploaded and the amount of memory arv-put will use?
2. Is there a way to estimate how much memory would be needed for a
certain folder/size of data?
3. Is there a way to make arv-put fail gracefully in cases like this?
4. If known, what is the reason that arv-put uses so much memory?


Subtasks

Task #17145: Review 17118-arvput-fixResolvedPeter Amstutz

Associated revisions

Revision b8679688
Added by Lucas Di Pentima 10 months ago

Merge branch '17118-arvput-fix'
Closes #17118

Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <>

History

#1 Updated by Peter Amstutz 11 months ago

  • Story points set to 1.0

#2 Updated by Lucas Di Pentima 11 months ago

  • Assigned To set to Lucas Di Pentima

#3 Updated by Peter Amstutz 11 months ago

  • Description updated (diff)

#4 Updated by Peter Amstutz 11 months ago

  • Description updated (diff)

#5 Updated by Peter Amstutz 11 months ago

  • Description updated (diff)

#6 Updated by Lucas Di Pentima 11 months ago

Upon arv-put debug logs inspection, I've seen that 228 blocks failed to be sent to one of the keepstores (all blocks sent to that keepstore failed) so they were sent to an alternative, and if we assume that all blocks are 64MB, the total gives us ~14.6GB of data... almost the amount of RAM the VM has.

I'll be reviewing the PySDK keep client code to see if I can detect any memory leak situation, and also try to set up a simulated environment to reproduce the problem.

#7 Updated by Lucas Di Pentima 11 months ago

Ok, so I've been able to reproduce the issue using arvbox.

Indeed it seems the problem happens when arv-put (or the PySDK keep client) talks directly to keepstore servers, and one of the keepstores is either down or responding a 500 error. The RAM usage starts to creep up until the process gets killed.

I did a similar test with all the keepstores working and the RAM usage keeps below 500 MB using --threads 1.

#8 Updated by Lucas Di Pentima 11 months ago

  • Status changed from New to In Progress

#9 Updated by Lucas Di Pentima 11 months ago

  • Target version changed from 2020-12-02 Sprint to 2020-12-16 Sprint

#10 Updated by Lucas Di Pentima 10 months ago

  • Target version changed from 2020-12-16 Sprint to 2021-01-06 Sprint

#11 Updated by Lucas Di Pentima 10 months ago

After some time doing different tests, I got to a fix that I'm yet not sure why it makes a difference:

diff --git a/sdk/python/arvados/keep.py b/sdk/python/arvados/keep.py
index bc43b849c..bd0e5dc1e 100644
--- a/sdk/python/arvados/keep.py
+++ b/sdk/python/arvados/keep.py
@@ -648,7 +648,7 @@ class KeepClient(object):

     class KeepWriterThread(threading.Thread):
-        TaskFailed = RuntimeError()
+        class TaskFailed(RuntimeError): pass

         def __init__(self, queue, data, data_hash, timeout=None):
             super(KeepClient.KeepWriterThread, self).__init__()
@@ -667,7 +667,7 @@ class KeepClient(object):
                 try:
                     locator, copies = self.do_task(service, service_root)
                 except Exception as e:
-                    if e is not self.TaskFailed:
+                    if not isinstance(e, self.TaskFailed):
                         _logger.exception("Exception in KeepWriterThread")
                     self.queue.write_fail(service)
                 else:
@@ -687,7 +687,7 @@ class KeepClient(object):
                                   self.data_hash,
                                   result['status_code'],
                                   result['body'])
-                raise self.TaskFailed
+                raise self.TaskFailed()

             _logger.debug("KeepWriterThread %s succeeded %s+%i %s",
                           str(threading.current_thread()),

The way I was able to reproduce the bug is by using arvbox on a RAM restricted VM (eg: 1.5 GB), shutting down one of the keepstores, and then by accessing to arvbox's shell, used arv-put to upload a dir with many GBs (supposedly with 2X the VM's RAM is enough). Using arv-put from arvbox's shell is important because this happens when accessing to keepstores directly and one of them fails.

#12 Updated by Lucas Di Pentima 10 months ago

Updates at 0ea8f67a0 - branch 17118-arvput-fix
Test run: https://ci.arvados.org/job/developer-run-tests/2236/
WB integration test re-run: https://ci.arvados.org/job/developer-run-tests-apps-workbench-integration/2368/

  • Changes the way custom exception raising is done on KeepWriterThread, as described on the note above.

Following what's described on https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/, I tried to clear the e.__traceback__ object on some places that I saw the exceptions being assigned to a local scoped variable, but didn't see any meaningful memory usage difference: when all keepstores are working, arv-put uses around 560 MB of resident memory, and when one of the keepstore is down, it uses around 800 MB.

#13 Updated by Anonymous 10 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

#14 Updated by Peter Amstutz 8 months ago

  • Release set to 37

Also available in: Atom PDF