Bug #20198
closedFix flaky test railsRestartSuite.TestConfigReload in lib/controller
Description
Test seems to fail more often when other test suites have been running recently on the same machine.
Example from developer-run-tests-remainder: #3691 /consoleText
FAIL: rails_restart_test.go:33: railsRestartSuite.TestConfigReload oldhash 3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3 newhash 57ea481af2d34b32a4b148446d6f6627ec6f5dea7c7f98d8b3fd83f5cc2f4767 waited 20.153174803s for rails to restart rails_restart_test.go:90: c.Check(string(body), check.Matches, `(?ms).*`+newhash+`.*`) ... value string = "" + ... "# HELP arvados_config_load_timestamp_seconds Time when config file was loaded.\n" + ... "# TYPE arvados_config_load_timestamp_seconds gauge\n" + ... "arvados_config_load_timestamp_seconds{sha256=\"3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3\"} 1677612413.8657956\n" + ... "# HELP arvados_config_source_timestamp_seconds Timestamp of config file when it was loaded.\n" + ... "# TYPE arvados_config_source_timestamp_seconds gauge\n" + ... "arvados_config_source_timestamp_seconds{sha256=\"3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3\"} 1677612371.5435643\n" + ... "# HELP arvados_version_running Indicated version is running.\n" + ... "# TYPE arvados_version_running gauge\n" + ... "arvados_version_running{version=\"unknown\"} 1\n" ... regex string = "(?ms).*57ea481af2d34b32a4b148446d6f6627ec6f5dea7c7f98d8b3fd83f5cc2f4767.*"
Updated by Tom Clegg almost 2 years ago
tl;dr: fork inherits flocks but not threads. We had passenger worker processes holding the "one watcher only" lock but not actually running a watcher thread.
long version:
Using run-tests.sh --interactive
this pattern seems to hold up reliably:
What next? reset What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload {pass} What next? test lib/crunchrun What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload {fail} What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload {fail} What next? reset What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload {pass}
I noticed that top in tree mode looked like this after a {pass}
248096 tom 20 0 286336 7460 5544 S 0.0 0.1 0:00.02 `- Passenger watchdog 248099 tom 20 0 2228656 12440 8028 S 0.3 0.2 0:00.12 `- Passenger core 248989 tom 20 0 376284 100324 15564 S 0.0 1.2 0:02.29 `- Passenger AppPreloader: /home/tom/arvados/services/api 249057 tom 20 0 376184 101016 11048 S 1.0 1.2 0:00.13 `- Passenger RubyApp: /home/tom/arvados/services/api (test)
After crunchrun tests, it looked like this
248096 tom 20 0 286336 7460 5544 S 0.0 0.1 0:00.03 `- Passenger watchdog 248099 tom 20 0 2228656 12812 8340 S 0.0 0.2 0:02.31 `- Passenger core 249057 tom 20 0 440944 127508 11332 S 0.0 1.6 0:10.73 `- Passenger RubyApp: /home/tom/arvados/services/api (test)
Note the "Passenger AppPreloader" process has disappeared and the "Passenger RubyApp" has detached from the "Passenger watchdog" tree.
My theory is- the way we launched our "flock and watch config file" thread in config/initializers meant it ran in the Passenger AppPreloader process instead of a worker process
- [usually, depending on timing, or perhaps always] the thread would grab flock on the reload_config.lock file before AppPreloader forked any worker processes
- therefore the worker processes inherited the lock (but not the thread) during fork
- for reasons/conditions I can't fully explain and don't seem to matter (perhaps including reload_config hitting restart.txt), the AppPreloader process goes away, causing RubyApp to detach (but keep running) and a new AppPreloader to appear, with its own RubyApp children
- at this point the old RubyApp process is still alive, and still holds the lock on reload_config.lock, but the old AppPreloader process that ran the watcher thread is gone, and the new AppPreloader process is waiting for the lock before starting a new watcher thread
To resolve this, I used the documented way to run the flock-and-watch thread in a worker process instead of the AppPreloader process when Passenger is in smart-spawn mode. This way, the only process holding the lock is the worker process, meaning that when a new AppPreloader starts, either the old worker holding the lock keeps running the watcher thread (so auto-reload works), or it ends and releases the lock so one of the new workers can start a watcher thread.
With this change, so far, "test lib/crunchrun" followed by "test lib/controller ..." no longer fails.
Updated by Tom Clegg almost 2 years ago
20198-passenger-restart @ c74be245074e7dc71cd43dd856a5c425513e4857 -- developer-run-tests: #3523
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-2:
20198-passenger-restart @ c74be245074e7dc71cd43dd856a5c425513e4857 -- developer-run-tests: #3523
Looks good to me, thanks.
To expound on everything you said: the reason this problem appears worse after e03ebe77c257e44645c64c8cd71e5e7c115991ef is because before that, the old RubyApp process still running the thread would crash, and that would maybe cause the current test run to fail but make the problem go away for future runs. With that crash resolved, now the old RubyApp hangs around for much longer, and tests keep failing as a result.
Updated by Brett Smith almost 2 years ago
Brett Smith wrote in #note-3:
To expound on everything you said: the reason this problem appears worse after e03ebe77c257e44645c64c8cd71e5e7c115991ef is because before that, the old RubyApp process still running the thread would crash, and that would maybe cause the current test run to fail but make the problem go away for future runs. With that crash resolved, now the old RubyApp hangs around for much longer, and tests keep failing as a result.
Actually no, the truth is the exact opposite of what I wrote. The old thread crashing let the tests pass, because then our current web server process could start a new thread that loaded the configuration correctly. When the old thread doesn't crash, we get the situation that this ticket resolves: there's a thread somewhere holding the lock and reloading the configuration but it's not connected to our current web server so the tests fail.
Updated by Tom Clegg almost 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|e7555eb1212db8627af6689b250aaf4bd38cefdb.