Bug #21031
closed
Most test-provision (and other) jobs failing
Added by Brett Smith about 1 year ago.
Updated 7 months ago.
Release relationship:
Auto
Description
Sometime in the last week, most of our test-provision jobs are failing like this:
Name: curl -s -L http://get.rvm.io | bash -s stable - Function: cmd.run - Result: Changed Started: - 05:41:59.444421 Duration: 3346.004 ms
Name: ruby-2.7.2 - Function: rvm.installed - Result: Changed Started: - 05:42:02.794050 Duration: 19877.371 ms
Name: arvados-api-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Changed Started: - 05:42:22.671973 Duration: 7715.731 ms
Name: arvados-cli - Function: gem.installed - Result: Changed Started: - 05:42:30.393352 Duration: 176696.635 ms
Name: arvados-api-server - Function: pkg.installed - Result: Changed Started: - 05:45:27.102923 Duration: 57970.474 ms
Name: nginx - Function: service.running - Result: Changed Started: - 05:46:25.110546 Duration: 1626.836 ms
Name: arvados-controller-package-install-gems-deps-pkg-installed - Function: pkg.installed - Result: Clean Started: - 05:46:26.740569 Duration: 349.975 ms
Name: arvados-cli - Function: gem.installed - Result: Clean Started: - 05:46:27.093146 Duration: 656.533 ms
Name: arvados-controller - Function: pkg.installed - Result: Changed Started: - 05:46:27.751995 Duration: 4402.48 ms
Name: arvados-controller - Function: service.running - Result: Changed Started: - 05:46:32.205163 Duration: 941.021 ms
----------
ID: arvados-controller-service-running-service-ready-cmd-run
Function: cmd.run
Name: while ! (curl -k -s https://dbn10.local:8800 | \
grep -qE "req-[a-z0-9]{20}.{5}error_token") do
echo 'waiting for API to be ready...'
sleep 1
done
Result: False
Comment: Command "while ! (curl -k -s https://dbn10.local:8800 | \
grep -qE "req-[a-z0-9]{20}.{5}error_token") do
echo 'waiting for API to be ready...'
sleep 1
done
" run
Started: 05:46:33.150493
Duration: 121883.032 ms
Changes:
----------
pid:
3898
retcode:
1
stderr:
stdout:
while ! (curl -k -s https://dbn10.local:8800 | \
grep -qE "req-[a-z0-9]{20}.{5}error_token") do
echo 'waiting for API to be ready...'
sleep 1
done
: Timed out after 120 seconds
Name: nginx - Function: service.mod_watch - Result: Changed Started: - 05:48:35.038389 Duration: 47.401 ms
Summary for local
--------------
Succeeded: 150 (changed=116)
Failed: 1
--------------
Total states run: 151
Total run time: 533.364 s
Build step 'Execute shell' marked build as failure
This includes the jobs for debian10, ubuntu1804, and ubuntu2004. Note debian11 is working fine.
- Status changed from New to In Progress
Some initial thoughts:
Is something else running on port 8800? The shell that's running is written in such a way that it would go into an infinite loop in that case. On the other hand, the only thing that should be changing is where the installer gets Workbench 2 from, so there shouldn't be any changes in port numbers. Plus git grep '\b8800\b'
doesn't return much, although maybe it's a default in one of Workbench 2's dependencies.
Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?
- Description updated (diff)
It's apparently #20862. test-provision succeeded on the 2.7.0 release, then fails on 80794f079f005fd3d927b9d330a46bcc96a1a132, which is the merge immediately after. And this actually makes more sense with the error: I'm guessing the gem shuffling causes a deployment problem for RailsAPI, and this means the up test never passes.
Brett Smith wrote in #note-1:
Why is debian11 passing when the others aren't? This isn't super surprising since it's closest to the platform where we do development. Is there some key version difference that debian11 is quietly relying on? Or did something get updated for debian11 that needs to be updated for other targets as well?
Judging by the logs, the failing platforms install Ruby from RVM, while debian11 uses packaged Ruby. Seems likely to be relevant. This was incorrect, ubuntu2004 is failing but also uses packaged Ruby.
debian11 installs arvados-login-sync where the others do not: This was also a red herring, this is only because debian11 doesn't fail before this point.
Comparing the states that are run via rg --color=never -o '\b\S+ state [^]]+\]'
on the Jenkins logs, the only differences between ubuntu2004 and debian11 are what you would expect: names, PostgreSQL version number, etc. There's no difference at the deployment level that would explain why debian11 succeeds and ubuntu2004 fails. It's something lower-level.
I was able to install the arvados-api-server
development package in my own Ubuntu 20.04 container. Which is what you'd expect, Jenkins can do that too, but I was keeping an eye out for any error messages that Jenkins might be ignoring and didn't see any.
Trying to replicate this further requires actually running the services, which is more challenging in a container. So either I'd need a way to get into the running Jenkins worker and debug, or else recreate the environment in a VM myself. That's expensive enough that I'm not going to do it without someone actually asking me.
So, I'm blocked on this unless someone wants to give me an idea to pursue; SSH access to Jenkins workers; or permission to go deep on this.
- Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
- Assigned To changed from Brett Smith to Lucas Di Pentima
- Tracker changed from Idea to Bug
- Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
- Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
- Subject changed from Most test-provision jobs failing to Most test-provision (and other) jobs failing
The issue I'm observing is the following: test-provision-debian11: #539
...
15:32:03 Running bundle config set --local path /var/www/arvados-api/shared/vendor_bundle... done.
15:32:03 Running bundle install... done.
15:32:03 Ensuring directory and file permissions ...... done.
15:32:03 Setting up database...Defaulting to memory cache, because /var/www/arvados-api/current/tmp/cache does not exist
15:32:03 rake aborted!
15:32:03 Don't know how to build task 'db:structure:load' (See the list of available tasks with `rake --tasks`)
15:32:03 Did you mean? db:structure:dump
15:32:03
15:32:03 (See full trace by running task with --trace)
15:32:03 failed.
15:32:03 dpkg: error processing package arvados-api-server (--configure):
15:32:03 installed arvados-api-server package post-installation script subprocess returned error exit status 1
...
Since Rails 6.1 the db:structure:{load, dump}
rake task have been deprecated: https://github.com/rails/rails/pull/39470 -- It seems we should be using db:schema:{load, dump}
now, with config.active_record.schema_format
set to :sql
(as we already do)
Updates at d6e41e6 - branch 21031-test-provision-fix
Package test run: developer-build-packages-debian11: #15
- Replaces
db:structure:load
with db:schema:load
in RailsAPI package build script.
- Updates
postinst.sh
script for Rails apps packages to reflect the above.
Given that the above test passed, I'll be merging this to main
so that we can build packages & test them on the test-provision jobs.
- Status changed from In Progress to Resolved
Ok, so the above tests didn't used the correct package for some reason, but subsequent runs did, and they all worked OK. Example: test-provision: #726
Also available in: Atom
PDF