Project

General

Profile

Actions

Bug #21524

closed

test-provision-ubuntu2004 intermittently times out waiting for the controller to come up

Added by Brett Smith 2 months ago. Updated 8 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
-

Description

This has happened repeatedly.

  Name: arvados-controller - Function: pkg.installed - Result: Changed Started: - 22:53:27.606297 Duration: 10607.743 ms
  Name: arvados-controller - Function: service.running - Result: Changed Started: - 22:53:38.277567 Duration: 1113.901 ms
----------
          ID: arvados-controller-service-running-service-ready-cmd-run
    Function: cmd.run
        Name: while ! (curl -k -s https://ubu20.local:8800 | \
         grep -qE "req-[a-z0-9]{20}.{5}error_token") do
  echo 'waiting for API to be ready...'
  sleep 1
done

      Result: False
     Comment: Command "while ! (curl -k -s https://ubu20.local:8800 | \
                       grep -qE "req-[a-z0-9]{20}.{5}error_token") do
                echo 'waiting for API to be ready...'
                sleep 1
              done
              " run
     Started: 22:53:39.397300
    Duration: 120828.503 ms
     Changes:   
              ----------
              pid:
                  37097
              retcode:
                  1
              stderr:
              stdout:
                  while ! (curl -k -s https://ubu20.local:8800 | \
                           grep -qE "req-[a-z0-9]{20}.{5}error_token") do
                    echo 'waiting for API to be ready...'
                    sleep 1
                  done
                   : Timed out after 120 seconds
  Name: nginx - Function: service.mod_watch - Result: Changed Started: - 22:55:40.233381 Duration: 63.787 ms

Summary for local
--------------
Succeeded: 144 (changed=106)
Failed:      1
--------------

Subtasks 1 (1 open0 closed)

Task #21557: ReviewNewLucas Di PentimaActions

Related issues

Related to Arvados - Support #21661: Test provision ubuntu-20.04 passesResolvedBrett SmithActions
Blocked by Arvados - Bug #21583: Running RailsAPI with Passenger implicity requires Ruby 3.3 via base64 0.2.0 lockResolvedBrett SmithActions
Actions #1

Updated by Brett Smith 2 months ago

One possibility is that the stack is ready but the server is returning a different error than expected, keeping the client stuck in its loop.

Actions #2

Updated by Brett Smith about 2 months ago

  • Target version changed from Future to Development 2024-03-13 sprint
  • Assigned To set to Brett Smith

This is happening on basically every run, it needs to be dealt with.

Actions #3

Updated by Peter Amstutz about 2 months ago

  • Target version set to Development 2024-03-13 sprint
  • Tracker changed from Idea to Bug
Actions #4

Updated by Brett Smith about 2 months ago

Conveniently(?), I got this exact same problem while testing #21383. The newly-created API server returns a Passenger error page:

We're sorry, but something went wrong.
The issue has been logged for investigation. Please try again later.
Technical details for the administrator of this website
Error ID: f15e3dfd
Details: Web application could not be started by the Phusion Passenger(R) application server.

Actions #5

Updated by Brett Smith about 2 months ago

[ E 2024-03-12 15:12:44.8347 907382/Tf age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/arvados-api/current: The application encountered the following error: You have already activated base64 0.1.1, but your Gemfile requires base64 0.2.0. Since base64 is a default gem, you can either remove your dependency on it or try updating to a newer version of bundler that supports base64 as a default gem. (Gem::LoadError)
Actions #6

Updated by Brett Smith about 2 months ago

  • Blocked by Bug #21583: Running RailsAPI with Passenger implicity requires Ruby 3.3 via base64 0.2.0 lock added
Actions #7

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #8

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #9

Updated by Peter Amstutz 17 days ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #10

Updated by Peter Amstutz 17 days ago

Actions #11

Updated by Brett Smith 16 days ago

43c64a7a9af20e0690e01b4e2484fa7196c66973 seems very likely to be relevant, since Ubuntu 20.04 runs Ruby 2.7. It may still need passenger_preload_bundler on from #21583 as well.

Actions #12

Updated by Brett Smith 16 days ago

In a focal Docker container, installing arvados-api-server reports this problem:

rake aborted!                                                                                                                            
URI::InvalidURIError: the scheme postgresql does not accept registry part: z2a09_arvados:PASSWORD@z2a09_db: (or bad hostname?)                                                                                                                    
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations/connection_url_resolver.rb:27:in `initialize'                                                                                                              
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations/url_config.rb:48:in `new'                                                                                                                                  
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations/url_config.rb:48:in `build_url_hash'                                                                                                                       
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations/url_config.rb:38:in `initialize'                                                                                                                           
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations.rb:246:in `new'   
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations.rb:246:in `environment_url_config'                                                                                                                         
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations.rb:161:in `build_configs'                                                                                                                                  
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/database_configurations.rb:20:in `initialize'                                                                                                                                      
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/tasks/database_tasks.rb:157:in `new'      
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/tasks/database_tasks.rb:157:in `for_each' 
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/railties/databases.rake:34:in `block (2 levels) in <top (required)>'                                                                                                               
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/railties/databases.rake:29:in `block in <top (required)>'                                                                                                                          
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/railties/databases.rake:9:in `<top (required)>'                                                                                                                                    
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/railtie.rb:54:in `load'                   
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8/lib/active_record/railtie.rb:54:in `block in <class:Railtie>'                                                                                                                                        
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:286:in `instance_exec'                      
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:286:in `block in run_tasks_blocks'          
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:298:in `each'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:298:in `each_registered_block'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:286:in `run_tasks_blocks'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/application.rb:500:in `block in run_tasks_blocks'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/engine/railties.rb:15:in `each'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/engine/railties.rb:15:in `each'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/application.rb:500:in `run_tasks_blocks'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/engine.rb:464:in `load_tasks'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:226:in `public_send'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/railties-7.0.8/lib/rails/railtie.rb:226:in `method_missing'
/var/www/arvados-api/current/Rakefile:12:in `<top (required)>'
(See full trace by running task with --trace)
 failed.

The package install exits 0, so higher levels think it succeeded. I don't think this is a networking problem, because in the same container I can connect to that PostgreSQL host with psql:

root@635a365e8a54:/# psql -h z2a09_db z2a09_arvados z2a09_arvados
Password for user z2a09_arvados: 
psql (12.18 (Ubuntu 12.18-0ubuntu0.20.04.1), server 16.2 (Debian 16.2-1.pgdg120+2))
WARNING: psql major version 12, server major version 16.
         Some psql features might not work.
Type "help" for help.

z2a09_arvados=> 
Actions #13

Updated by Brett Smith 16 days ago

Nevermind, the previous is caused by the underscore, which is not valid.

Actions #14

Updated by Brett Smith 16 days ago

App 39774 output: /usr/lib/ruby/2.7.0/bundler/spec_set.rb:86:in `block in materialize': Could not find zlib-3.1.0 in any of the sources (Bundler::GemNotFound)
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/spec_set.rb:80:in `map!'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/spec_set.rb:80:in `materialize'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/definition.rb:170:in `specs'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/definition.rb:237:in `specs_for'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/definition.rb:226:in `requested_specs'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/runtime.rb:101:in `block in definition_method'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/runtime.rb:20:in `setup'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler.rb:149:in `setup'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/setup.rb:20:in `block in <top (required)>'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/ui/shell.rb:136:in `with_level'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/ui/shell.rb:88:in `silence'
App 39774 output:       from /usr/lib/ruby/2.7.0/bundler/setup.rb:20:in `<top (required)>'
App 39774 output:       from /usr/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require'
App 39774 output:       from /usr/lib/ruby/2.7.0/rubygems/core_ext/kernel_require.rb:92:in `require'
[ E 2024-04-11 18:31:53.5661 35215/Tbo age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/arvados-api/current: The application process exited prematurely.
  Error ID: eb4ffd2e
  Error details saved to: /tmp/passenger-error-yvLGD0.html

But:

$ ls -l "$PWD/zlib-3.1.0.gem" 
-rw-r--r-- 1 root root 38912 Apr 11 16:38 /var/www/arvados-api/current/vendor/cache/zlib-3.1.0.gem

So that's… odd.

Actions #15

Updated by Brett Smith 16 days ago

vendor/cache is not the right place to look for gems to load, instead it's /var/www/arvados-api/shared/vendor_bundle/ruby/VERSION, and sure enough zlib is missing from that. The build-packages-ubuntu2004 log calls out that it's installing zlib 3.1.0 without issue, so not clear yet why it's missing.

Actions #16

Updated by Brett Smith 16 days ago

Can reproduce the problem locally. Whatever process is expected to install the zlib gem inside that directory is not.

Actions #17

Updated by Brett Smith 16 days ago

I think we got it. Strap in.

The vendor_bundle directory is not intended to ship in the package itself, and the built package correctly does not include anything there. Instead it is built by the package's postinst script, source:build/rails-package-scripts/postinst.sh:

 if ! $COMMAND_PREFIX bundle --version >/dev/null 2>&1; then
      run_and_report "Installing bundler" $COMMAND_PREFIX gem install bundler --version 2.2.19 --no-document
  fi

  run_and_report "Running bundle config set --local path $SHARED_PATH/vendor_bundle" \
      $COMMAND_PREFIX bin/bundle config set --local path $SHARED_PATH/vendor_bundle

  run_and_report "Running bundle install" \
      $COMMAND_PREFIX bin/bundle install --local --quiet

So why isn't it installing zlib 3.1.0? Well, the Salt installer installs the arvados-cli gem on the shell node, and it does this directly with gem. This also pulls in zlib 3.1.0, but installs it to the system gem home, /var/lib/gems.

When you run bundle install, it seems to pick up on this install, and think everything is hunky-dory. But when you run bundle exec and friends, it sets up a stricter environment, and then can no longer find zlib 3.1.0, and fails.

This theory explains why the job started failing with cbfdb1b66ab9c1b6e69d1c9cd589633386267177: that updated the zlib lock to 3.1.0, which happens to be the latest version, which means the system gem version happens to match the bundle gem version. Before this there was a mismatch so (I'm guessing) bundle install would do what we wanted.

Why doesn't this problem appear on my bookworm test install? For whatever reason, the focal install has Bundler 2.2.19, while bookworm has Bundler 2.4.19. I'm guessing that's relevant. It does feel like a Bundler bug that bundle install and bundle exec don't look at the same set of gems, so I wouldn't be surprised if that got addressed in a 2.2→2.4 transition.

Actions #18

Updated by Brett Smith 15 days ago

Brett Smith wrote in #note-17:

Why doesn't this problem appear on my bookworm test install? For whatever reason, the focal install has Bundler 2.2.19, while bookworm has Bundler 2.4.19. I'm guessing that's relevant. It does feel like a Bundler bug that bundle install and bundle exec don't look at the same set of gems, so I wouldn't be surprised if that got addressed in a 2.2→2.4 transition.

Doing the obvious thing of upgrading Bundler, then running bundle install and bundle exec still sees the same problem and doesn't change anything.

Here are some other differences between Bundler versions:

Distribution Version included with distro Ruby Version installed by us Version in vendor_cache
focal 2.1.2 2.2.19 None
bookworm 2.3.7 2.2.19 2.4.19

Note that we attempt to standardize on 2.2.19. However, when the postinst script runs bin/bundle, that seems to pick up whatever latest version is available. So it's possible when we run bundle install from postinst on bookworm, we get new functionality added to Bundler through version 2.3.7 that we don't get elsewhere.

Another possibility is, I am comparing a single-node focal install to a multi-node bookworm install. It's possible there's differences in what Salt does there that causes changes. However, I think that's less likely, because I note at least the bookworm controller node still has the arvados-api gem installed system-wide (why, I don't know). So it also has zlib 3.1.0 as a system gem just like focal does. I'm not ruling out this possibility, just saving it for investigation later.

Actions #19

Updated by Brett Smith 15 days ago

Brett Smith wrote in #note-18:

Note that we attempt to standardize on 2.2.19. However, when the postinst script runs bin/bundle, that seems to pick up whatever latest version is available. So it's possible when we run bundle install from postinst on bookworm, we get new functionality added to Bundler through version 2.3.7 that we don't get elsewhere.

This PR to make sure Bundler runs as the same version in Gemfile.lock was part of the 2.3.0 release. And sure enough you see it working, e.g., in the logs for build-packages-debian12:

15:55:32 Don't run Bundler as root. Installing your bundle as root will break this
15:55:32 application for all non-root users on this machine.
15:55:32 Bundler 2.4.10 is running, but your lockfile was generated with 2.4.19. Installing Bundler 2.4.19 and restarting using that version.
15:55:33 Fetching gem metadata from https://rubygems.org/.
15:55:33 Fetching bundler 2.4.19
15:55:33 Installing bundler 2.4.19
[now it re-execs and repeats the same warning:]
15:55:34 Don't run Bundler as root. Installing your bundle as root will break this
15:55:34 application for all non-root users on this machine.

In general I believe our attempt to "standardize" on Bundler 2.2.19 is a farce. We install it a lot but it's only setting a minimum version. Running the bundler command will pick up the latest version, which is >2.2.19 on every distro after debian11. And when it's >=2.3.0, Bundler will standardize itself on the version in Gemfile.lock, which is currently 2.4.19. And all this is true both when we build packages and when they get installed. And note when we build packages what matters is the version of Bundler that came with the Ruby we installed with RVM, not the distro Ruby.

I would like to expand the scope of this ticket to be a bunch of Ruby build cleanup:

  • Stop using RVM in our package build Dockerfiles, just use the distro Ruby to better match the install environment.
  • Make our "standard" Bundler install gem install --conservative --version '>=2.4.0' bundler. As long as we have some version matching that then I think we can rely on Bundler's own lock behavior to take us the rest of the way.
  • Add an API server test that Bundler is locked to version 2.4.X, because 2.5.0 drops support for Ruby 2.7.
  • Standardize API server on Bundler 2.4.22, because that's the last 2.4.X release so why not.

I'll be honest, I have no idea whether or not this will actually fix anything for test-provision-ubuntu2004, but I still want to do it because:

  • At this point I've read enough to be confident this is the right approach.
  • All the RVM stuff in our package builds is historical cruft that isn't doing anything for us anymore. Removing it would be easier to maintain long-term and also speed up package builds.
  • It would make this stuff easier to debug because then instead of trying to "standardize" our Bundler version in twenty different places we can trust that the version in Gemfile.lock is the one that actually matters.
Actions #20

Updated by Tom Clegg 15 days ago

All of #note-19 sounds good to me.

Actions #21

Updated by Brett Smith 11 days ago

I have confirmed that the problem is related to some interaction between gems installed system-wide and setting up the bundle for arvados-api server. I set up the base state by running:

sudo apt remove arvados-api-server
sudo rm -rf /var/lib/gems/2.7.0/ /var/www/arvados-api/
sudo gem install --conservative --version '~>2.4.0' bundler

From this state, if I just sudo dpkg -i arvados-api-server_VERSION.deb, it works. If I sudo gem install arvados-cli, then install the arvados-api-server package, it fails.

I was hoping a Bundler ugprade would fix this bug but apparently not. :/

Actions #22

Updated by Brett Smith 10 days ago

I have a minimal reproduction that takes out all of our packaging infrastructure, etc. On my development box:

% no | bundle gem ubuntubug
% cd ubuntubug
% cat >ubuntubug.gemspec <<EOF
# frozen_string_literal: true

require_relative "lib/ubuntubug/version" 

Gem::Specification.new do |spec|
  spec.name = "ubuntubug" 
  spec.version = Ubuntubug::VERSION
  spec.authors = ["Brett Smith"]
  spec.email = ["brett.smith@curii.com"]

  spec.summary = "Reproduction of Arvados issue #21524" 
  spec.homepage = "https://dev.arvados.org/issues/21524" 
  spec.required_ruby_version = ">= 2.6.0" 

  # Specify which files should be added to the gem when it is released.
  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
  spec.files = Dir.chdir(__dir__) do
    `git ls-files -z`.split("\x0").reject do |f|
      (File.expand_path(f) == __FILE__) ||
        f.start_with?(*%w[bin/ test/ spec/ features/ .git .circleci appveyor Gemfile])
    end
  end
  spec.bindir = "exe" 
  spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
  spec.require_paths = ["lib"]

  spec.add_dependency "zlib", "= 3.1.0" 
end
EOF
% bundle install
% bundle config set cache_all true
% bundle package

So far everything looks good: rake and zlib gem files are in vendor/cache. I transfer the whole gem directory to an Ubuntu 20.04 box, and inside that directory, run:

% gem install bundler --version 2.4.22
Successfully installed bundler-2.4.22                                                                                                    
[…]
% mkdir vendor_bundle
% bundle config set --local path "$PWD/vendor_bundle" 
% bundle install --local
Installing rake 13.2.1
Installing zlib 3.1.0 with native extensions
Updating files in vendor/cache
Bundle complete! 2 Gemfile dependencies, 4 gems now installed.
Bundled gems are installed into `./vendor_bundle`
% bundle exec ruby -e 'puts "hi"'
hi

This prints the expected output. Next:

% rm -rf vendor_bundle/*
% gem install zlib --version 3.1.0
Fetching zlib-3.1.0.gem
Building native extensions. This could take a while...
Successfully installed zlib-3.1.0
[…]
% bundle install --local
Installing rake 13.2.1
Updating files in vendor/cache
Bundle complete! 2 Gemfile dependencies, 4 gems now installed.
Bundled gems are installed into `./vendor_bundle`
% bundle exec ruby -e 'puts "hi"'
Could not find zlib-3.1.0 in locally installed gems
Run `bundle install` to install missing gems.

From here I can file a bug, play with possible workarounds without going through the whole package-building infrastructure, etc.

Actions #23

Updated by Brett Smith 10 days ago

One thing I don't really understand is why we do bundle install at postinst time. The reason there's a distinction between bundle package and bundle install is that it lets you run each part on different machines which might have different C libraries, architectures, etc. Since our packaging environment makes all that stuff consistent, we don't need to worry about it. It seems to me we could fully install the gems locally when we package API server and ship that, and not have to touch Bundler configuration, installed Gems, etc. at postinst time at all. We're already doing the analogous thing when we ship virtualenvs in Python packages. If I'm right, this would make the package both easier to maintain and faster to install.

Actions #24

Updated by Brett Smith 8 days ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF