Bug #7225

[SDKs] Script hangs on exit after writing a Collection file that spans multiple Keep blocks

Added by Brett Smith over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
09/07/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

qr1hi-8i9sb-h5yt6xmpk8u6dps is a pretty typical BWA aligner job. The aligner apparently ran fine, but then run-command got stuck uploading the data. These lines were are the last interesting ones that appear in the log:

2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: /keep/39c6f22d40001074f4200a72559ae7eb+5745/bwa completed with exit code 0 (success)
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: the following output files will be saved to keep:
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: 1455988972 ./[filename].sai
2015-09-04_15:16:52 qr1hi-8i9sb-h5yt6xmpk8u6dps 4106 0 stderr run-command: start writing output to keep

After that, run-command was never heard from again. When I checked on the compute node, the run-command process was still alive, but not doing anything. strace reported it was stuck in a futex call.

The last two lines in run-command update the task with success and output information, and exit. The API server logs show that it received and handled the task update with no problem, shortly after those last lines in the log, implying run-command got stuck somewhere between sending the request and exiting.

If the fix for this requires users to make specific API calls, Tom should sign off on those requirements as architect, and the requirements should be clearly documented.

The branch that fixes this is expected to include a test for the unsigned locator race condition.

7225.py (488 Bytes) 7225.py Brett Smith, 09/07/2015 01:53 PM

Subtasks

Task #7346: Review 7225-collection-hangResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #5496: [SDKs] PySDK test_rewrite_on_empty_file is not reliableResolved03/18/2015

Associated revisions

Revision b87fa036 (diff)
Added by Brett Smith over 5 years ago

7225: ArvadosFile.flush() commits all underlying blocks.

Fix a typo that caused only the last block to be flushed.
Refs #7225.

Revision 009c08b4
Added by Peter Amstutz over 5 years ago

Merge branch '7225-collection-hang' closes #7225

History

#1 Updated by Brett Smith over 5 years ago

Seeing this again on qr1hi-8i9sb-ozbzy8do1i217gq. I went into the Docker container to look around. The job's output directory has the .sai file as a single regular file. Simple Unix tools can read it just fine.

I killed the run-command process. Note this in the logs:

2015-09-07_10:40:50 qr1hi-8i9sb-ozbzy8do1i217gq 21483 0 child 21724 on compute0.1 exit 137 success=true

If success=true, that strongly suggests that the task update call at the bottom of run-command succeeded, and we're getting stuck somewhere between sending that and exiting.

I also noticed the output manifest in the log includes unsigned locators, which seems surprising.

#2 Updated by Brett Smith over 5 years ago

  • Subject changed from [SDKs] run-command deadlocked writing output to [SDKs] run-command deadlocked during exit
  • Description updated (diff)

Brett Smith wrote:

If success=true, that strongly suggests that the task update call at the bottom of run-command succeeded, and we're getting stuck somewhere between sending that and exiting.

Confirmed this. I found run-command's task update in the API server logs. The timestamp corresponds to when the aligner finished.

Right now there seem to be two problems: this, and the issue that the task's output includes unsigned locators. Maybe these are related? Maybe a deadlock is preventing blocks from being put, or the resulting signed locators from being saved; as well as preventing the process from exiting?

#3 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

Revering the description change that I intended to discard because I decided it was premature.

#4 Updated by Brett Smith over 5 years ago

Brett Smith wrote:

Right now there seem to be two problems: this, and the issue that the task's output includes unsigned locators. Maybe these are related? Maybe a deadlock is preventing blocks from being put, or the resulting signed locators from being saved; as well as preventing the process from exiting?

Probably worth noting: both the original task named in the description and the retry in note-1 have a similar set of unsigned locators in their manifests. The unsigned locators occur at the end of the manifest except the last locator is signed. The original task has three locators unsigned this way. The rerun has two.

#5 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

#6 Updated by Brett Smith over 5 years ago

Its log message made me think d2d7138c84a55ef87937cdaefd9c58a66916d76f might have something to do with the unsigned locators. But it apparently doesn't. qr1hi-8i9sb-juyvbsw26drapgd is the same job run with a prior arvados_sdk_version (1b8caff3ad598744e4a0379b01fc95ca4838caa0), and it generated the exact same output as qr1hi-8i9sb-ozbzy8do1i217gq.

#7 Updated by Brett Smith over 5 years ago

I used my admin powers to create the output collection with the unsigned manifests. I can get the data back out. At one point or another, the data has all been uploaded, so it doesn't seem like the problem is trouble writing to Keep.

#8 Updated by Brett Smith over 5 years ago

  • File 7225.py 7225.py added
  • Subject changed from [SDKs] run-command deadlocked during exit to [SDKs] Writing files to a Collection can deadlock

The attached script reproduces the issue when run on the task's output file. It mirrors the logic in run-command and the associated vwd module, while taking out most of the conditionals that don't apply in this case. The Python process writes out the manifest text, with some missing signatures, then gets stuck in a futex call instead of exiting properly.

#9 Updated by Brett Smith over 5 years ago

  • Status changed from New to In Progress

#10 Updated by Brett Smith over 5 years ago

It looks like the bug manifests when writing a file that spans more than one block. The script will exit normally when writing many files, as long as they're all <= 64MiB. But as soon as you run it on a file that's a single byte over the block size, it hangs.

The unsigned locator pattern is kind of fascinating. If all the blocks are exactly 64MiB, all the locators will be signed. If the last block is < 64MiB, then at least two locators before the last will be unsigned, and maybe more—the exact number of unsigned locators varies between runs.

#11 Updated by Brett Smith over 5 years ago

  • Subject changed from [SDKs] Writing files to a Collection can deadlock to [SDKs] Script hangs on exit after writing a Collection file that spans multiple Keep blocks

#12 Updated by Brett Smith over 5 years ago

Adding out_coll.stop_threads() to the end of 7225.py ends the deadlocks, but the unsigned locators remain.

#13 Updated by Brett Smith over 5 years ago

I found the bug. In ArvadosFile.flush():

           for s in self._segments:
                bb = self.parent._my_block_manager().get_bufferblock(s.locator)
                if bb:
                    if bb.state() != _BufferBlock.COMMITTED:
                        self.parent._my_block_manager().commit_bufferblock(self._current_bblock, sync=True)

The first argument to commit_bufferblock() should be bb. As it's currently written, we may not wait for blocks other than the last to finish. Calling locator() on any block that hasn't finished committing will return an unsigned locator.

Need to figure out how to test this…

#14 Updated by Brett Smith over 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-30 sprint

#15 Updated by Brett Smith over 5 years ago

  • Priority changed from Normal to High
  • Story points set to 1.0

#16 Updated by Brett Smith over 5 years ago

  • Status changed from In Progress to New

#17 Updated by Brett Smith over 5 years ago

The change in b87fa036 is almost certainly correct, but fixes neither problem. Running with this version of the SDK, 7225.py still writes a manifest with unsigned locators, and still hangs after that.

For what it's worth, I found that bug by playing with 7225.py in pdb. I started by stepping through the manifest_text() call. That showed me that when commit_all() ran, there were no blocks to commit, but there were still unsigned locators in the manifest. That led me to figure that what I really wanted to step through was the out_file.close() call that happens thanks to the context manager. Sure enough, when I stepped through that, all the locators were signed, which made me suspect a race condition in flush() (the only interesting thing that happens in close()), or some method it calls. I stopped when I found the above typo, but it looks like we may need to dig a little deeper.

#18 Updated by Brett Smith over 5 years ago

  • Story points deleted (1.0)

#19 Updated by Brett Smith over 5 years ago

  • Story points set to 2.0

#20 Updated by Brett Smith over 5 years ago

  • Assigned To set to Peter Amstutz

#21 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

A user writing their own Crunch script tripped on this. As a temporary workaround, I suggested they sleep(10) before closing each file, and call stop_threads() on each Collection before the end of the script. The user confirmed that those workarounds allowed the job to finish successfully where it previously hung.

#22 Updated by Tom Clegg over 5 years ago

7225-collection-hang @ 92ce25e

Re typo fix: When we catch StateChangeError in commit_bufferblock, we should re-raise the exception if it doesn't match what we expect. AFAICT the only time we want to swallow the exception is "invalid state change from PENDING to PENDING", is that right?

Re stop_threads(): I don't quite follow why this helps. First of all: What is the rationale for setting thread.daemon = False in the first place? Why keep the Python process alive finishing a write even though there's no way for the resulting locator to make it into a collection? More importantly, note-17 above makes it sound like there can still be unsigned locators at the end of the [previous version of] commit_all(), i.e., after we check for ERROR state. So I worry that calling stop_threads() here merely masks the bug, such that it only happens when the racy writes encounter errors. IOW, how can stop_threads() make a difference here if self._put_queue.join() is doing what we're relying on it to do?

Also, ftr, need tests before closing the story (but not necessarily before merging the fix).

#23 Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

7225-collection-hang @ 92ce25e

Re typo fix: When we catch StateChangeError in commit_bufferblock, we should re-raise the exception if it doesn't match what we expect. AFAICT the only time we want to swallow the exception is "invalid state change from PENDING to PENDING", is that right?

That's correct. The bug was that it was raising StateChangeError with self.state (a function object) instead of self._state (the correct state variable) so the comparison e.state == BufferBlock.PENDING in commit_bufferblock() would always be false, and the it would not sleep on the block.wait_for_commit.wait() barrier; as a result it would return an unsigned locator in the cases where the block wasn't finished uploading. With the bugfix, it waits for the block to go to the committed or error state as intended, so there is no more race condition.

Re stop_threads(): I don't quite follow why this helps. First of all: What is the rationale for setting thread.daemon = False in the first place? Why keep the Python process alive finishing a write even though there's no way for the resulting locator to make it into a collection? More importantly, note-17 above makes it sound like there can still be unsigned locators at the end of the [previous version of] commit_all(), i.e., after we check for ERROR state. So I worry that calling stop_threads() here merely masks the bug, such that it only happens when the racy writes encounter errors. IOW, how can stop_threads() make a difference here if self._put_queue.join() is doing what we're relying on it to do?

This a separate issue from unsigned locators; this is the hang-on-exit problem. It is more of an design mistake than a bug. I intended for Collection() to be used in a with block, so that on exit it would commit the collection and stop the threads. However, if a user doesn't use it in a with block and doesn't know to call stop_threads() manually, the PUT thread doesn't get shut down and the program hangs.

The rationale for having the PUT writer be a foreground thread (blocking program shutdown) is that if you asked the SDK to write some blocks for you, it would be surprising if it failed to do so because the program ended before the queue was empty. However, you have a point that after save() the PUT queue will be empty, and if the program terminates without calling save() then even if there are pending blocks to be written they are going to be unreferenced. So maybe it's okay to make PUT a background thread and allow that those blocks not to get written at all. What do you think?

Also, ftr, need tests before closing the story (but not necessarily before merging the fix).

I know. I'll have to figure out some way to mock it up that reproduces the problem (without the bugfix) that is fixed by the bugfix.

#24 Updated by Peter Amstutz over 5 years ago

New branch at b16358f which includes the same state typo fix but makes the PUT thread a background thread instead of adding stop_threads() to commit_all().

#25 Updated by Peter Amstutz over 5 years ago

Added a test which fails on master and is fixed by this branch. Now at 16edd7c54ac58cda991d3e9f0a1b48cea3dd985c

#26 Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

Tom Clegg wrote:

7225-collection-hang @ 92ce25e

Re typo fix: When we catch StateChangeError in commit_bufferblock, we should re-raise the exception if it doesn't match what we expect. AFAICT the only time we want to swallow the exception is "invalid state change from PENDING to PENDING", is that right?

That's correct. The bug was that it was raising StateChangeError with self.state (a function object) instead of self._state (the correct state variable) so the comparison e.state == BufferBlock.PENDING in commit_bufferblock() would always be false, and the it would not sleep on the block.wait_for_commit.wait() barrier; as a result it would return an unsigned locator in the cases where the block wasn't finished uploading. With the bugfix, it waits for the block to go to the committed or error state as intended, so there is no more race condition.

Ah. So it's actually the block.wait_for_commit.wait() call that is responsible for ensuring everything is written? It's a bit counterintuitive, because it seems like that only gets called for blocks that are already marked PENDING. But that does seem to be true in the commit_all case we're looking at.

The docstring for commit_bufferblock says it "initiates a background upload" and the docstring for commit_all says it "will not return until all buffer blocks are uploaded" so I'm expecting commit_all to have some code that waits until the background uploads are done. And it looks like it's trying to achieve that by calling self._put_queue.join(). So I'm wondering
  • Why doesn't join() fix the race?
  • If the real mechanism is "set_state twice, to make sure everything is already PENDING on the second flush", that should probably be clarified, and it seems like it still has race issues (what if a block goes into ERROR state between the two calls to set_state?)

Any thoughts on "we should re-raise the exception if it doesn't match what we expect"?

Re stop_threads(): I don't quite follow why this helps. First of all: What is the rationale for setting thread.daemon = False in the first place? Why keep the Python process alive finishing a write even though there's no way for the resulting locator to make it into a collection? More importantly, note-17 above makes it sound like there can still be unsigned locators at the end of the [previous version of] commit_all(), i.e., after we check for ERROR state. So I worry that calling stop_threads() here merely masks the bug, such that it only happens when the racy writes encounter errors. IOW, how can stop_threads() make a difference here if self._put_queue.join() is doing what we're relying on it to do?

This a separate issue from unsigned locators; this is the hang-on-exit problem. It is more of an design mistake than a bug. I intended for Collection() to be used in a with block, so that on exit it would commit the collection and stop the threads. However, if a user doesn't use it in a with block and doesn't know to call stop_threads() manually, the PUT thread doesn't get shut down and the program hangs.

The rationale for having the PUT writer be a foreground thread (blocking program shutdown) is that if you asked the SDK to write some blocks for you, it would be surprising if it failed to do so because the program ended before the queue was empty. However, you have a point that after save() the PUT queue will be empty, and if the program terminates without calling save() then even if there are pending blocks to be written they are going to be unreferenced. So maybe it's okay to make PUT a background thread and allow that those blocks not to get written at all. What do you think?

Unless I'm forgetting something, if you ask the SDK to write some blocks for you, but exit the program before they finish writing (therefore before you see the signed locators), then [a] evidently you don't care that much about writing them, since you have no idea whether the writes succeeded, and [b] there is no way for you to tell the difference between "block was written" and "block wasn't written" anyway: either you already have a signed locator (which means a copy was already written beforehand) or you don't (which means you can't GET the block regardless of whether it was written).

(The clarification that stop_threads() isn't supposed to fix the unsigned locators is definitely helpful, thanks.)

Also, ftr, need tests before closing the story (but not necessarily before merging the fix).

I know. I'll have to figure out some way to mock it up that reproduces the problem (without the bugfix) that is fixed by the bugfix.

Indeed. Perhaps mock "put" with something that waits for a signal before calling the real "put", and mock _put_queue.join() to send that signal?

#27 Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

Ah. So it's actually the block.wait_for_commit.wait() call that is responsible for ensuring everything is written? It's a bit counterintuitive, because it seems like that only gets called for blocks that are already marked PENDING. But that does seem to be true in the commit_all case we're looking at.

block.wait_for_commit.wait() is called when the block is already pending and the caller has requested a synchronous commit, in this case it needs to wait until the outstanding commit completes.

The docstring for commit_bufferblock says it "initiates a background upload" and the docstring for commit_all says it "will not return until all buffer blocks are uploaded" so I'm expecting commit_all to have some code that waits until the background uploads are done. And it looks like it's trying to achieve that by calling self._put_queue.join(). So I'm wondering

Because in this case, commit_all() is not where the blocks are getting committed. They are actually getting committed earlier in ArvFile.flush() which goes through and does a synchronous commit of each uncommitted block. Since some blocks were already "in flight" and the bug was preventing it from waiting for those blocks to be committed, it would return on blocks that were still pending and they would be recorded with unsigned locators.

  • Why doesn't join() fix the race?

Because it's already been flushed, so there's nothing to do in commit_all().

  • If the real mechanism is "set_state twice, to make sure everything is already PENDING on the second flush", that should probably be clarified, and it seems like it still has race issues (what if a block goes into ERROR state between the two calls to set_state?)

set_state() is transactional, when set_state(PENDING) is called in commit_bufferblock() we don't know if it is in WRITABLE or PENDING state, and if it were to get the state from the state() method that would be a race. However if it is already in PENDING that's okay, it just needs to wait until it exits PENDING state (which will move to either COMMITTED or ERROR).

Any thoughts on "we should re-raise the exception if it doesn't match what we expect"?

I'm not sure what you mean?

Unless I'm forgetting something, if you ask the SDK to write some blocks for you, but exit the program before they finish writing (therefore before you see the signed locators), then [a] evidently you don't care that much about writing them, since you have no idea whether the writes succeeded, and [b] there is no way for you to tell the difference between "block was written" and "block wasn't written" anyway: either you already have a signed locator (which means a copy was already written beforehand) or you don't (which means you can't GET the block regardless of whether it was written).

Yes, I agree with that logic, so the PUT thread is now a daemon thread.

#28 Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

Because it's already been flushed, so there's nothing to do in commit_all().

Aha, sorry. commit_all just doesn't get invoked at all here, right? So join() does work, but not if you don't call commit_all!

Any thoughts on "we should re-raise the exception if it doesn't match what we expect"?

I'm not sure what you mean?

Here:

        try:
            # Mark the block as PENDING so to disallow any more appends.
            block.set_state(_BufferBlock.PENDING)
        except StateChangeError as e:
            if e.state == _BufferBlock.PENDING and sync:
                block.wait_for_commit.wait()
                if block.state() == _BufferBlock.ERROR:
                    raise block.error
            return

What if e.state != _BufferBlock.PENDING? It seems we should not silently ignore that exception. Ignoring an unexpected exception is the bad practice that made this particular bug much harder to find, afaict.

#29 Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

Because it's already been flushed, so there's nothing to do in commit_all().

Aha, sorry. commit_all just doesn't get invoked at all here, right? So join() does work, but not if you don't call commit_all!

It gets called later when Collection.manifest_text() is called, but at that point there's nothing to do.

Any thoughts on "we should re-raise the exception if it doesn't match what we expect"?

I'm not sure what you mean?

Here:

[...]

What if e.state != _BufferBlock.PENDING? It seems we should not silently ignore that exception. Ignoring an unexpected exception is the bad practice that made this particular bug much harder to find, afaict.

Except that this wasn't ignoring an unexpected exception, it was failing to recognize an expected exception. However, I can put in some fallback code to re-raise the exception if the exception isn't considered benign.

#30 Updated by Tom Clegg over 5 years ago

The test case looks good, thanks.

#31 Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

What if e.state != _BufferBlock.PENDING? It seems we should not silently ignore that exception. Ignoring an unexpected exception is the bad practice that made this particular bug much harder to find, afaict.

Except that this wasn't ignoring an unexpected exception, it was failing to recognize an expected exception. However, I can put in some fallback code to re-raise the exception if the exception isn't considered benign.

The caught exception had a function instead of a valid state in e.state. That was not the exception we were expecting. The fix was to change the exception, not the recognition. But this is just splitting hairs. We didn't recognize it as the benign exception, therefore we should have re-raised it instead of swallowing it. If we had had an "else: raise" like good citizens, we would have seen a stack trace showing us where to find the bug, instead of mysterious hanging threads and unsigned locators and stepping-through-code adventures. And as it turns out, adding "else: raise" to master (like this: 3071c69) confirms that even the existing test cases have already been hitting this particular bug, and would have prevented it from getting to production. Just sayin'.

#32 Updated by Peter Amstutz over 5 years ago

Improved exception handling. Now at d4f1534

#33 Updated by Tom Clegg over 5 years ago

This looks better, thanks.

One more clarification question:

            if e.state == _BufferBlock.PENDING:
                if sync:
                    block.wait_for_commit.wait()
                else:
                    return
            if block.state() == _BufferBlock.COMMITTED:
                return
            elif block.state() == _BufferBlock.ERROR:
                raise block.error

Since ERROR→PENDING is allowed and COMMITTED is final, it seems the only way to reach "raise block.error" is by going through the "if sync" block, so it would have the same effect if it were moved up there -- is this correct? (The way it's written, it looks like it's accommodating "raise error if ERROR state caused StateChangeError", but if I'm following correctly that's impossible, so it really only handles errors coming out of the "if PENDING and sync" case.)

#34 Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

This looks better, thanks.

One more clarification question:

[...]

Since ERROR→PENDING is allowed and COMMITTED is final, it seems the only way to reach "raise block.error" is by going through the "if sync" block, so it would have the same effect if it were moved up there -- is this correct? (The way it's written, it looks like it's accommodating "raise error if ERROR state caused StateChangeError", but if I'm following correctly that's impossible, so it really only handles errors coming out of the "if PENDING and sync" case.)

Yes, and that's the way it was originally written (checking for error only after PENDING and sync), I just moved it out one level out of an abundance of caution. The reason the original code swallowed the error is that (as you note) the state chart allows that the only other state the block could be in is COMMITTED, so it made a (unfortunate) assumption that was the case instead of explicitly checking for the COMMITTED state.

#35 Updated by Peter Amstutz over 5 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:009c08b43affdefa6f2825059a971c750b90e222.

Also available in: Atom PDF