Bug #11002

[arv-put] crash in arvfile on upload NoneType object has no attribute 'closed'

Added by Tom Morris 3 months ago. Updated about 1 month ago.

Status:ResolvedStart date:01/28/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:


Target version:2017-03-01 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-


8995M / 387184M 2.3% Traceback (most recent call last):
File "/home/tfmorris/venv/bin/arv-put", line 4, in <module>
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 906, in main
writer.start(save_collection=not(args.stream or args.raw))
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/commands/put.py", line 454, in start
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 240, in synchronized_wrapper
return orig_func(self, *args, **kwargs)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/collection.py", line 934, in manifest_text
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 681, in commit_all
self.repack_small_blocks(force=True, sync=True)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 240, in synchronized_wrapper
return orig_func(self, *args, **kwargs)
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 574, in repack_small_blocks
small_blocks = [b for b in self._bufferblocks.values() if b.state() == _BufferBlock.WRITABLE and b.owner.closed()]
AttributeError: 'NoneType' object has no attribute 'closed'
Exception in thread Thread-3 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 763, in run
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/arvfile.py", line 481, in _commit_bufferblock_worker
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/retry.py", line 158, in num_retries_setter
File "/home/tfmorris/venv/local/lib/python2.7/site-packages/arvados/keep.py", line 1069, in put
<type 'exceptions.TypeError'>: 'NoneType' object is not callable


Task #11041: Review 11002-arvput-crash-fixResolvedPeter Amstutz

Associated revisions

Revision 9889021d
Added by Lucas Di Pentima about 1 month ago

Merge branch '11002-arvput-crash-fix'
Closes #11002


#1 Updated by Lucas Di Pentima 3 months ago

  • Assignee set to Lucas Di Pentima

#2 Updated by Lucas Di Pentima 2 months ago

  • Status changed from New to In Progress

#3 Updated by Lucas Di Pentima 2 months ago

  • Target version changed from 2017-02-15 sprint to 2017-03-01 sprint

#4 Updated by Lucas Di Pentima 2 months ago

Branch 11002-arvput-crash-fix at e2ea7599812d12dd027930069ee92e8816ee9de8
Test run: https://ci.curoverse.com/job/developer-run-tests/172/

When a block containing small files is being assembled, if the process is interrupted before the block is committed to Keep, arv-put will try to save its state before quitting but the cached collection's BlockManager will be in an inconsistent state (will still have the block in writable state, and without an owner because there are multiple files inside it) and will produce the stack trace described.
We now check for this case when trying to save the last checkpoint before quitting, and if it happens, print a warning message to the user and close the cache file.

#5 Updated by Peter Amstutz 2 months ago

So if I'm following along, the problem is that it is in repack_small_blocks() which creates a new, un-owned bufferblock into which it copies the other block data. If that gets interrupted, it escapes the method before calling _delete_bufferblock() so there is still an un-owned bufferblock. When it tries to flush again later, it runs into the un-owned bufferblock and fails because it isn't expecting it.

However the real problem is that KeyboardInterrupt can be thrown anywhere any time, which can leave the data structures in an inconsistent state, of which this bug is just one example. I am suspicious that trying to flush after catching ^C could also result in a silently corrupted checkpoint file. I think that when ^C (KeyboardInterrupt) is raised, it should not try to checkpoint.

Alternately, you could refactor it so that the upload happens on a different thread which doesn't get interrupted, and the main thread just sits in an idle loop. If the main thread gets a KeyboardInterrupt, it would then signal the uploader thread to perform a graceful shutdown.

#6 Updated by Lucas Di Pentima 2 months ago

Updates: a83e4de
Test run: https://ci.curoverse.com/job/developer-run-tests/175/

Changed so when catching a KeyboardInterrupt exception, stops the checkpointing thread and closes the cache file without trying to do the last update.

#7 Updated by Peter Amstutz 2 months ago

From our conversations on IRC, I think KeyboardInterrupt should skip checkpointing but shouldn't log anything. However, there should also be a default exception handler which which logs the exception and also skips checkpointing.

#8 Updated by Lucas Di Pentima 2 months ago

Updates at d2a202e

Added a specific exception class for this kind of errors, so it's not confused with other AttributeError bugs that could happen in the future.
Now state saving is not done when quitting from any kind of exception, reporting the stack trace to the logger, and not reporting anything when quitting from a KeyboardInterrupt (SystemExit(-2) in this case) or SIGINT.
Updated some tests that were relying on the previous behaviour.

#9 Updated by Peter Amstutz 2 months ago

From IRC: make a note around line L454 that because we catch signals, we expect a SystemExit, not KeyboardInterrupt.

#10 Updated by Peter Amstutz 2 months ago

Rest LGTM.

#11 Updated by Lucas Di Pentima about 1 month ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:9889021db676c152d5b62f399bd27024b5a51ae1.

Also available in: Atom PDF