Bug #21031
closedMost test-provision (and other) jobs failing
Description
Sometime in the last week, most of our test-provision jobs are failing like this:
Name: curl -s -L http://get.rvm.io | bash -s stable - Function: cmd.run - Result: Changed Started: - 05:41:59.444421 Duration: 3346.004 ms Name: ruby-2.7.2 - Function: rvm.installed - Result: Changed Started: - 05:42:02.794050 Duration: 19877.371 ms Name: arvados-api-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Changed Started: - 05:42:22.671973 Duration: 7715.731 ms Name: arvados-cli - Function: gem.installed - Result: Changed Started: - 05:42:30.393352 Duration: 176696.635 ms Name: arvados-api-server - Function: pkg.installed - Result: Changed Started: - 05:45:27.102923 Duration: 57970.474 ms Name: nginx - Function: service.running - Result: Changed Started: - 05:46:25.110546 Duration: 1626.836 ms Name: arvados-controller-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Clean Started: - 05:46:26.740569 Duration: 349.975 ms Name: arvados-cli - Function: gem.installed - Result: Clean Started: - 05:46:27.093146 Duration: 656.533 ms Name: arvados-controller - Function: pkg.installed - Result: Changed Started: - 05:46:27.751995 Duration: 4402.48 ms Name: arvados-controller - Function: service.running - Result: Changed Started: - 05:46:32.205163 Duration: 941.021 ms ---------- ID: arvados-controller-service-running-service-ready-cmd-run Function: cmd.run Name: while ! (curl -k -s https://dbn10.local:8800 | \ grep -qE "req-[a-z0-9]{20}.{5}error_token") do echo 'waiting for API to be ready...' sleep 1 done Result: False Comment: Command "while ! (curl -k -s https://dbn10.local:8800 | \ grep -qE "req-[a-z0-9]{20}.{5}error_token") do echo 'waiting for API to be ready...' sleep 1 done " run Started: 05:46:33.150493 Duration: 121883.032 ms Changes: ---------- pid: 3898 retcode: 1 stderr: stdout: while ! (curl -k -s https://dbn10.local:8800 | \ grep -qE "req-[a-z0-9]{20}.{5}error_token") do echo 'waiting for API to be ready...' sleep 1 done : Timed out after 120 seconds Name: nginx - Function: service.mod_watch - Result: Changed Started: - 05:48:35.038389 Duration: 47.401 ms Summary for local -------------- Succeeded: 150 (changed=116) Failed: 1 -------------- Total states run: 151 Total run time: 533.364 s Build step 'Execute shell' marked build as failure
This includes the jobs for debian10, ubuntu1804, and ubuntu2004. Note debian11 is working fine.
Updated by Brett Smith about 1 year ago
- Status changed from New to In Progress
Some initial thoughts:
Is something else running on port 8800? The shell that's running is written in such a way that it would go into an infinite loop in that case. On the other hand, the only thing that should be changing is where the installer gets Workbench 2 from, so there shouldn't be any changes in port numbers. Plus git grep '\b8800\b'
doesn't return much, although maybe it's a default in one of Workbench 2's dependencies.
Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?
Updated by Brett Smith about 1 year ago
- Description updated (diff)
Apparently this isn't necessarily related to #18874. I just bisected by running test-provision against c1bc5396fd54f376a11741cfb7ce420b5929a5a5, which was the commit immediately before that merge, and it's failing the same way.
Updated by Brett Smith about 1 year ago
It's apparently #20862. test-provision succeeded on the 2.7.0 release, then fails on 80794f079f005fd3d927b9d330a46bcc96a1a132, which is the merge immediately after. And this actually makes more sense with the error: I'm guessing the gem shuffling causes a deployment problem for RailsAPI, and this means the up test never passes.
Updated by Brett Smith about 1 year ago
Brett Smith wrote in #note-1:
Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?
Judging by the logs, the failing platforms install Ruby from RVM, while debian11 uses packaged Ruby. Seems likely to be relevant. This was incorrect, ubuntu2004 is failing but also uses packaged Ruby.
Updated by Brett Smith about 1 year ago
debian11 installs arvados-login-sync where the others do not: This was also a red herring, this is only because debian11 doesn't fail before this point.
Comparing the states that are run via rg --color=never -o '\b\S+ state [^]]+\]'
on the Jenkins logs, the only differences between ubuntu2004 and debian11 are what you would expect: names, PostgreSQL version number, etc. There's no difference at the deployment level that would explain why debian11 succeeds and ubuntu2004 fails. It's something lower-level.
Updated by Brett Smith about 1 year ago
I was able to install the arvados-api-server
development package in my own Ubuntu 20.04 container. Which is what you'd expect, Jenkins can do that too, but I was keeping an eye out for any error messages that Jenkins might be ignoring and didn't see any.
Trying to replicate this further requires actually running the services, which is more challenging in a container. So either I'd need a way to get into the running Jenkins worker and debug, or else recreate the environment in a VM myself. That's expensive enough that I'm not going to do it without someone actually asking me.
So, I'm blocked on this unless someone wants to give me an idea to pursue; SSH access to Jenkins workers; or permission to go deep on this.
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
- Assigned To changed from Brett Smith to Lucas Di Pentima
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Updated by Lucas Di Pentima about 1 year ago
- Subject changed from Most test-provision jobs failing to Most test-provision (and other) jobs failing
Updated by Lucas Di Pentima about 1 year ago
The issue I'm observing is the following: test-provision-debian11: #539
... 15:32:03 Running bundle config set --local path /var/www/arvados-api/shared/vendor_bundle... done. 15:32:03 Running bundle install... done. 15:32:03 Ensuring directory and file permissions ...... done. 15:32:03 Setting up database...Defaulting to memory cache, because /var/www/arvados-api/current/tmp/cache does not exist 15:32:03 rake aborted! 15:32:03 Don't know how to build task 'db:structure:load' (See the list of available tasks with `rake --tasks`) 15:32:03 Did you mean? db:structure:dump 15:32:03 15:32:03 (See full trace by running task with --trace) 15:32:03 failed. 15:32:03 dpkg: error processing package arvados-api-server (--configure): 15:32:03 installed arvados-api-server package post-installation script subprocess returned error exit status 1 ...
Updated by Lucas Di Pentima about 1 year ago
Since Rails 6.1 the db:structure:{load, dump}
rake task have been deprecated: https://github.com/rails/rails/pull/39470 -- It seems we should be using db:schema:{load, dump}
now, with config.active_record.schema_format
set to :sql
(as we already do)
Updated by Lucas Di Pentima about 1 year ago
Updates at d6e41e6 - branch 21031-test-provision-fix
Package test run: developer-build-packages-debian11: #15
- Replaces
db:structure:load
withdb:schema:load
in RailsAPI package build script. - Updates
postinst.sh
script for Rails apps packages to reflect the above.
Updated by Lucas Di Pentima about 1 year ago
Given that the above test passed, I'll be merging this to main
so that we can build packages & test them on the test-provision jobs.
Updated by Lucas Di Pentima about 1 year ago
Build packages pipeline: build-packages-multijob: #3811
Updated by Lucas Di Pentima about 1 year ago
Test provision job still failing with the same message: test-provision-debian11: #541
Updated by Lucas Di Pentima about 1 year ago
- Status changed from In Progress to Resolved
Ok, so the above tests didn't used the correct package for some reason, but subsequent runs did, and they all worked OK. Example: test-provision: #726