Bug #17175

arv-boot sometimes fail to start (ie: WB2 integration tests)

Added by Lucas Di Pentima 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Start date:
12/29/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

From time to time arv-boot fails to start, producing the following output in the case of wb2 integration tests pipelines:

...
18:37:19 sudo is already the newest version (1.8.27-1+deb10u2).
18:37:19 wget is already the newest version (1.20.1-1.1).
18:37:19 xvfb is already the newest version (2:1.20.4-1+deb10u1).
18:37:19 zlib1g-dev is already the newest version (1:1.2.11.dfsg-1).
18:37:19 ca-certificates is already the newest version (20200601~deb10u1).
18:37:20 0 upgraded, 0 newly installed, 0 to remove and 57 not upgraded.
18:37:20 INFO[2020-12-01T21:37:20.117040482Z] ruby 2.5.7 already installed                 
18:37:20 INFO[2020-12-01T21:37:20.122311238Z] go 1.14 already installed                    
18:37:20 INFO[2020-12-01T21:37:20.154868995Z] phantomjs 1.9.8 already installed            
18:37:20 INFO[2020-12-01T21:37:20.156001403Z] geckodriver 0.24.0 already installed         
18:37:20 INFO[2020-12-01T21:37:20.159528791Z] nodejs v8.15.1 already installed             
18:37:20 INFO[2020-12-01T21:37:20.884538246Z] gradle 5.3.1 already installed               
18:37:20 INFO[2020-12-01T21:37:20.886043239Z] locale en_US.UTF-8 already installed         
18:37:20 INFO[2020-12-01T21:37:20.920794537Z] postgresql cluster 11-main is down; trying to start 
18:37:20 2020-12-01 21:37:20.996 UTC [792] LOG:  listening on IPv4 address "127.0.0.1", port 5432
18:37:20 2020-12-01 21:37:20.996 UTC [792] LOG:  could not bind IPv6 address "::1": Cannot assign requested address
18:37:20 2020-12-01 21:37:20.996 UTC [792] HINT:  Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
18:37:21 2020-12-01 21:37:21.001 UTC [792] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 
18:37:21 2020-12-01 21:37:21.020 UTC [795] LOG:  database system was shut down at 2020-09-28 16:47:32 UTC
18:37:21 2020-12-01 21:37:21.047 UTC [792] LOG:  database system is ready to accept connections
18:37:22 2020-12-01 21:37:22.009 UTC [803] root@root FATAL:  role "root" does not exist
18:37:22 INFO[2020-12-01T21:37:22.362835777Z] postgresql supports collation en_US          
18:37:22 INFO[2020-12-01T21:37:22.423076600Z] postgresql supports collation en_US.UTF-8    
18:37:22 INFO[2020-12-01T21:37:22.503320541Z] arvados role exists; superuser privileges added, password updated 
18:37:22 INFO[2020-12-01T21:37:22.503416234Z] sent SIGTERM; waiting for postgres to shut down 
18:37:22 2020-12-01 21:37:22.503 UTC [792] LOG:  received smart shutdown request
18:37:22 2020-12-01 21:37:22.508 UTC [792] LOG:  background worker "logical replication launcher" (PID 801) exited with exit code 1
18:37:22 2020-12-01 21:37:22.511 UTC [796] LOG:  shutting down
18:37:22 2020-12-01 21:37:22.577 UTC [792] LOG:  database system is shut down
18:37:22 Launching arvados in test mode...
18:39:58 + set +e +o pipefail
18:39:58 + kill 834
18:39:58 + wait 834
18:39:58 + true
18:39:58 + '[' 1 -eq 1 ']'
18:39:58 + rm -rf /tmp/tmp.1sNVIKdaCS
18:39:59 + echo done
18:39:59 done
18:39:59 make: *** [Makefile:67: integration-tests] Error 1
18:39:59 Build step 'Execute shell' marked build as failure
18:39:59 Archiving artifacts
18:39:59 An attempt to send an e-mail to empty list of recipients, ignored.
18:39:59 Started calculate disk usage of build
18:39:59 Finished Calculation of disk usage of build in 0 seconds
18:39:59 Started calculate disk usage of workspace
18:40:00 Finished Calculation of disk usage of workspace in  1 second
18:40:00 Finished: FAILURE

Attached is the arvados.log file from one of those failures when running the tests locally.

arvados-failed.log (82.8 KB) arvados-failed.log Lucas Di Pentima, 12/01/2020 09:46 PM

Subtasks

Task #17179: Review 17175-boot-fail-rake-seedResolvedLucas Di Pentima

Associated revisions

Revision 1ce3429c
Added by Tom Clegg 4 months ago

Merge branch '17175-boot-fail-rake-seed'

fixes #17175

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Lucas Di Pentima 5 months ago

  • Target version changed from 2020-12-16 Sprint to 2021-01-06 Sprint

#2 Updated by Tom Clegg 5 months ago

  • Assigned To changed from Lucas Di Pentima to Tom Clegg
  • Status changed from New to In Progress

17175-boot-fail-rake-seed @ e4a6240bf2ca163c36a4357ef0f80958c40bd39a -- https://ci.arvados.org/view/Developer/job/developer-run-tests/2237/

    If the rake task and the app start concurrently, they both try to set
    up the database seeds. The exclusive-locking mechanism in
    services/api/app/lib/current_api_client.rb doesn't cover this case
    because the rake task and app don't use the same cache. If the app
    commits the database seeds between the rake task's "check existing" 
    and "commit" steps, commit fails and the rake task aborts.

(also included a fix for the logging labels that were showing the first argument ("-l" or "-D") instead of the program name ("initdb" or "postgres") in the setuidgid-wrapper case)

#3 Updated by Tom Clegg 5 months ago

The database errors in the issue description are normal -- it seems we have a number of "FATAL" logs that are actually harmless -- but the attached arvados-failed.log file shows [what I think is] the rake-vs.-app race problem.

#4 Updated by Lucas Di Pentima 5 months ago

The fix LGTM, I was suffering around 50% of failed test runs on lib/controller but after merging your branch the failures are gone. Thanks!

#5 Updated by Tom Clegg 4 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF