Bug #9018

[Node manager] exception handler should not kill parent process

Added by Tom Clegg over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Story points:
-

Description

A race condition in test_fatal_error (tests.test_failure.ActorUnhandledExceptionTest) causes os.killpg() to be called after it has been unstubbed. This kills the test suite and run-tests.sh.

There are two problems here:
  • The test should not have a race condition
  • The exception handler should only kill node manager itself, not other processes.

Proposed fix for overkill

Use os._exit() or os.kill(0,9) instead of os.killpg()

Proposed fix for test race

TBD?

Associated revisions

Revision aea53001
Added by Peter Amstutz over 5 years ago

Merge branch '9018-nodemanager-kill-instead-of-killpg' closes #9018

History

#1 Updated by Tom Clegg over 5 years ago

  • Description updated (diff)
  • Category set to Node Manager

#2 Updated by Brett Smith over 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Peter Amstutz over 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-05-25 sprint

#4 Updated by Peter Amstutz over 5 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:aea5300167770beb3cca6ad90e5ebb04da961416.

#5 Updated by Tom Clegg over 5 years ago

The test race might still exist. However, it hasn't been seen recently, so maybe some other changes have fixed it by accident.

(11:07:12) tetron_: I haven't seen the race condition happen 
(11:07:59) tetron_: and I haven't been able to work out a sequence that would cause it to happen
(11:10:51) tetron_: I believe the race only happens if the test also fails for some other reason and it's unable to wait for the actor to stop

Also available in: Atom PDF