Project

General

Profile

Actions

Bug #20198

closed

Fix flaky test railsRestartSuite.TestConfigReload in lib/controller

Added by Tom Clegg about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
0.5
Release relationship:
Auto

Description

Test seems to fail more often when other test suites have been running recently on the same machine.

Example from developer-run-tests-remainder: #3691 /consoleText

FAIL: rails_restart_test.go:33: railsRestartSuite.TestConfigReload

oldhash 3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3
newhash 57ea481af2d34b32a4b148446d6f6627ec6f5dea7c7f98d8b3fd83f5cc2f4767
waited 20.153174803s for rails to restart
rails_restart_test.go:90:
    c.Check(string(body), check.Matches, `(?ms).*`+newhash+`.*`)
... value string = "" +
...     "# HELP arvados_config_load_timestamp_seconds Time when config file was loaded.\n" +
...     "# TYPE arvados_config_load_timestamp_seconds gauge\n" +
...     "arvados_config_load_timestamp_seconds{sha256=\"3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3\"} 1677612413.8657956\n" +
...     "# HELP arvados_config_source_timestamp_seconds Timestamp of config file when it was loaded.\n" +
...     "# TYPE arvados_config_source_timestamp_seconds gauge\n" +
...     "arvados_config_source_timestamp_seconds{sha256=\"3d073c3d60d9f4951d5d1db1a3b3f6a776f2e0a0dca9b6a869ca88310d8fb5a3\"} 1677612371.5435643\n" +
...     "# HELP arvados_version_running Indicated version is running.\n" +
...     "# TYPE arvados_version_running gauge\n" +
...     "arvados_version_running{version=\"unknown\"} 1\n" 
... regex string = "(?ms).*57ea481af2d34b32a4b148446d6f6627ec6f5dea7c7f98d8b3fd83f5cc2f4767.*" 

Subtasks 1 (0 open1 closed)

Task #20199: Review 20198-passenger-restartResolvedTom Clegg03/02/2023Actions
Actions #1

Updated by Tom Clegg about 1 year ago

tl;dr: fork inherits flocks but not threads. We had passenger worker processes holding the "one watcher only" lock but not actually running a watcher thread.

long version:

Using run-tests.sh --interactive this pattern seems to hold up reliably:

What next? reset
What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload
{pass}
What next? test lib/crunchrun
What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload
{fail}
What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload
{fail}
What next? reset
What next? test lib/controller -check.vv -check.f=railsRestartSuite.TestConfigReload
{pass}

I noticed that top in tree mode looked like this after a {pass}

 248096 tom       20   0  286336   7460   5544 S   0.0   0.1   0:00.02  `- Passenger watchdog                                                                                                                                                                        
 248099 tom       20   0 2228656  12440   8028 S   0.3   0.2   0:00.12      `- Passenger core                                                                                                                                                                        
 248989 tom       20   0  376284 100324  15564 S   0.0   1.2   0:02.29          `- Passenger AppPreloader: /home/tom/arvados/services/api                                                                                                                            
 249057 tom       20   0  376184 101016  11048 S   1.0   1.2   0:00.13              `- Passenger RubyApp: /home/tom/arvados/services/api (test)                                                                                                                      

After crunchrun tests, it looked like this

 248096 tom       20   0  286336   7460   5544 S   0.0   0.1   0:00.03  `- Passenger watchdog                                                                                                                                                                        
 248099 tom       20   0 2228656  12812   8340 S   0.0   0.2   0:02.31      `- Passenger core                                                                                                                                                                        
 249057 tom       20   0  440944 127508  11332 S   0.0   1.6   0:10.73  `- Passenger RubyApp: /home/tom/arvados/services/api (test)                                                                                                                                  

Note the "Passenger AppPreloader" process has disappeared and the "Passenger RubyApp" has detached from the "Passenger watchdog" tree.

My theory is
  • the way we launched our "flock and watch config file" thread in config/initializers meant it ran in the Passenger AppPreloader process instead of a worker process
  • [usually, depending on timing, or perhaps always] the thread would grab flock on the reload_config.lock file before AppPreloader forked any worker processes
  • therefore the worker processes inherited the lock (but not the thread) during fork
  • for reasons/conditions I can't fully explain and don't seem to matter (perhaps including reload_config hitting restart.txt), the AppPreloader process goes away, causing RubyApp to detach (but keep running) and a new AppPreloader to appear, with its own RubyApp children
  • at this point the old RubyApp process is still alive, and still holds the lock on reload_config.lock, but the old AppPreloader process that ran the watcher thread is gone, and the new AppPreloader process is waiting for the lock before starting a new watcher thread

To resolve this, I used the documented way to run the flock-and-watch thread in a worker process instead of the AppPreloader process when Passenger is in smart-spawn mode. This way, the only process holding the lock is the worker process, meaning that when a new AppPreloader starts, either the old worker holding the lock keeps running the watcher thread (so auto-reload works), or it ends and releases the lock so one of the new workers can start a watcher thread.

With this change, so far, "test lib/crunchrun" followed by "test lib/controller ..." no longer fails.

Actions #3

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-2:

20198-passenger-restart @ c74be245074e7dc71cd43dd856a5c425513e4857 -- developer-run-tests: #3523

Looks good to me, thanks.

To expound on everything you said: the reason this problem appears worse after e03ebe77c257e44645c64c8cd71e5e7c115991ef is because before that, the old RubyApp process still running the thread would crash, and that would maybe cause the current test run to fail but make the problem go away for future runs. With that crash resolved, now the old RubyApp hangs around for much longer, and tests keep failing as a result.

Actions #4

Updated by Brett Smith about 1 year ago

Brett Smith wrote in #note-3:

To expound on everything you said: the reason this problem appears worse after e03ebe77c257e44645c64c8cd71e5e7c115991ef is because before that, the old RubyApp process still running the thread would crash, and that would maybe cause the current test run to fail but make the problem go away for future runs. With that crash resolved, now the old RubyApp hangs around for much longer, and tests keep failing as a result.

Actually no, the truth is the exact opposite of what I wrote. The old thread crashing let the tests pass, because then our current web server process could start a new thread that loaded the configuration correctly. When the old thread doesn't crash, we get the situation that this ticket resolves: there's a thread somewhere holding the lock and reloading the configuration but it's not connected to our current web server so the tests fail.

Actions #5

Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions

Also available in: Atom PDF