Project

General

Profile

Actions

Bug #21031

closed

Most test-provision (and other) jobs failing

Added by Brett Smith 7 months ago. Updated 10 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
-
Release:
Release relationship:
Auto

Description

Sometime in the last week, most of our test-provision jobs are failing like this:

  Name: curl -s -L http://get.rvm.io | bash -s stable - Function: cmd.run - Result: Changed Started: - 05:41:59.444421 Duration: 3346.004 ms
  Name: ruby-2.7.2 - Function: rvm.installed - Result: Changed Started: - 05:42:02.794050 Duration: 19877.371 ms
  Name: arvados-api-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Changed Started: - 05:42:22.671973 Duration: 7715.731 ms
  Name: arvados-cli - Function: gem.installed - Result: Changed Started: - 05:42:30.393352 Duration: 176696.635 ms
  Name: arvados-api-server - Function: pkg.installed - Result: Changed Started: - 05:45:27.102923 Duration: 57970.474 ms
  Name: nginx - Function: service.running - Result: Changed Started: - 05:46:25.110546 Duration: 1626.836 ms
  Name: arvados-controller-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Clean Started: - 05:46:26.740569 Duration: 349.975 ms
  Name: arvados-cli - Function: gem.installed - Result: Clean Started: - 05:46:27.093146 Duration: 656.533 ms
  Name: arvados-controller - Function: pkg.installed - Result: Changed Started: - 05:46:27.751995 Duration: 4402.48 ms
  Name: arvados-controller - Function: service.running - Result: Changed Started: - 05:46:32.205163 Duration: 941.021 ms
----------
          ID: arvados-controller-service-running-service-ready-cmd-run
    Function: cmd.run
        Name: while ! (curl -k -s https://dbn10.local:8800 | \
         grep -qE "req-[a-z0-9]{20}.{5}error_token") do
  echo 'waiting for API to be ready...'
  sleep 1
done

      Result: False
     Comment: Command "while ! (curl -k -s https://dbn10.local:8800 | \
                       grep -qE "req-[a-z0-9]{20}.{5}error_token") do
                echo 'waiting for API to be ready...'
                sleep 1
              done
              " run
     Started: 05:46:33.150493
    Duration: 121883.032 ms
     Changes:   
              ----------
              pid:
                  3898
              retcode:
                  1
              stderr:
              stdout:
                  while ! (curl -k -s https://dbn10.local:8800 | \
                           grep -qE "req-[a-z0-9]{20}.{5}error_token") do
                    echo 'waiting for API to be ready...'
                    sleep 1
                  done
                   : Timed out after 120 seconds
  Name: nginx - Function: service.mod_watch - Result: Changed Started: - 05:48:35.038389 Duration: 47.401 ms

Summary for local
--------------
Succeeded: 150 (changed=116)
Failed:      1
--------------
Total states run:     151
Total run time:   533.364 s
Build step 'Execute shell' marked build as failure

This includes the jobs for debian10, ubuntu1804, and ubuntu2004. Note debian11 is working fine.

Actions #1

Updated by Brett Smith 7 months ago

  • Status changed from New to In Progress

Some initial thoughts:

Is something else running on port 8800? The shell that's running is written in such a way that it would go into an infinite loop in that case. On the other hand, the only thing that should be changing is where the installer gets Workbench 2 from, so there shouldn't be any changes in port numbers. Plus git grep '\b8800\b' doesn't return much, although maybe it's a default in one of Workbench 2's dependencies.

Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?

Actions #2

Updated by Brett Smith 7 months ago

  • Description updated (diff)

Apparently this isn't necessarily related to #18874. I just bisected by running test-provision against c1bc5396fd54f376a11741cfb7ce420b5929a5a5, which was the commit immediately before that merge, and it's failing the same way.

Actions #3

Updated by Brett Smith 7 months ago

It's apparently #20862. test-provision succeeded on the 2.7.0 release, then fails on 80794f079f005fd3d927b9d330a46bcc96a1a132, which is the merge immediately after. And this actually makes more sense with the error: I'm guessing the gem shuffling causes a deployment problem for RailsAPI, and this means the up test never passes.

Actions #4

Updated by Brett Smith 7 months ago

Brett Smith wrote in #note-1:

Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?

Judging by the logs, the failing platforms install Ruby from RVM, while debian11 uses packaged Ruby. Seems likely to be relevant. This was incorrect, ubuntu2004 is failing but also uses packaged Ruby.

Actions #5

Updated by Brett Smith 7 months ago

debian11 installs arvados-login-sync where the others do not: This was also a red herring, this is only because debian11 doesn't fail before this point.

Comparing the states that are run via rg --color=never -o '\b\S+ state [^]]+\]' on the Jenkins logs, the only differences between ubuntu2004 and debian11 are what you would expect: names, PostgreSQL version number, etc. There's no difference at the deployment level that would explain why debian11 succeeds and ubuntu2004 fails. It's something lower-level.

Actions #6

Updated by Brett Smith 7 months ago

I was able to install the arvados-api-server development package in my own Ubuntu 20.04 container. Which is what you'd expect, Jenkins can do that too, but I was keeping an eye out for any error messages that Jenkins might be ignoring and didn't see any.

Trying to replicate this further requires actually running the services, which is more challenging in a container. So either I'd need a way to get into the running Jenkins worker and debug, or else recreate the environment in a VM myself. That's expensive enough that I'm not going to do it without someone actually asking me.

So, I'm blocked on this unless someone wants to give me an idea to pursue; SSH access to Jenkins workers; or permission to go deep on this.

Actions #7

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
  • Assigned To changed from Brett Smith to Lucas Di Pentima
Actions #8

Updated by Peter Amstutz 7 months ago

  • Tracker changed from Idea to Bug
Actions #9

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #10

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #11

Updated by Lucas Di Pentima 6 months ago

  • Subject changed from Most test-provision jobs failing to Most test-provision (and other) jobs failing
Actions #12

Updated by Lucas Di Pentima 6 months ago

The issue I'm observing is the following: test-provision-debian11: #539

...
15:32:03 Running bundle config set --local path /var/www/arvados-api/shared/vendor_bundle... done.
15:32:03 Running bundle install... done.
15:32:03 Ensuring directory and file permissions ...... done.
15:32:03 Setting up database...Defaulting to memory cache, because /var/www/arvados-api/current/tmp/cache does not exist
15:32:03 rake aborted!
15:32:03 Don't know how to build task 'db:structure:load' (See the list of available tasks with `rake --tasks`)
15:32:03 Did you mean?  db:structure:dump
15:32:03 
15:32:03 (See full trace by running task with --trace)
15:32:03  failed.
15:32:03 dpkg: error processing package arvados-api-server (--configure):
15:32:03  installed arvados-api-server package post-installation script subprocess returned error exit status 1
...
Actions #13

Updated by Lucas Di Pentima 6 months ago

Since Rails 6.1 the db:structure:{load, dump} rake task have been deprecated: https://github.com/rails/rails/pull/39470 -- It seems we should be using db:schema:{load, dump} now, with config.active_record.schema_format set to :sql (as we already do)

Actions #14

Updated by Lucas Di Pentima 6 months ago

Updates at d6e41e6 - branch 21031-test-provision-fix
Package test run: developer-build-packages-debian11: #15

  • Replaces db:structure:load with db:schema:load in RailsAPI package build script.
  • Updates postinst.sh script for Rails apps packages to reflect the above.
Actions #15

Updated by Lucas Di Pentima 6 months ago

Given that the above test passed, I'll be merging this to main so that we can build packages & test them on the test-provision jobs.

Actions #16

Updated by Lucas Di Pentima 6 months ago

Build packages pipeline: build-packages-multijob: #3811

Actions #17

Updated by Lucas Di Pentima 6 months ago

Test provision job still failing with the same message: test-provision-debian11: #541

Actions #18

Updated by Lucas Di Pentima 6 months ago

  • Status changed from In Progress to Resolved

Ok, so the above tests didn't used the correct package for some reason, but subsequent runs did, and they all worked OK. Example: test-provision: #726

Actions #19

Updated by Peter Amstutz 10 days ago

  • Release set to 70
Actions

Also available in: Atom PDF