Bug #22389
closedSingle-host single-hostname installation fails: Rails API server cannot start; /etc/arvados/config.yml "permission denied
Description
Steps to reproduce:
Follow the recipes for installing Arvados in virtual machine:
1 Create base image: https://docs.google.com/document/d/1Groandn4iLw-2f6PGNlmQhdp5sO3K0LICnUw3RhMzEw/edit?usp=sharing
2 Install Arvados in guest (up to this step) https://docs.google.com/document/d/1w6DQqR3D65DcTCpMX51RrDvz547bYImCrPoY8A71-bE/edit?tab=t.0#bookmark=id.m117nxmcbdvw
The installation will fail; the Rails API server will fail to start.
Inside the guest, following /var/log/nginx/error.log
, we can see errors like those seen in the attached files (the referenced Passenger HTML error report file is also attached)
Notably there is a line about
App 26718 output: open /etc/arvados/config.yml: permission denied
which I think explains the subsequent failure to get database password (which is in the config file).
It's unclear to me what is being denied exactly, and how this could happen. After the installation, the config file has owner root:www-data and permissions 620; the directory /etc/arvados
has the right permissions too (owner root:www-data, permission 750).
I also tried setting permission to everyone-readable on the config file and everyone-searchable on the /etc/arvados
directory, but this did not resolve the problem.
Files
Updated by Zoë Ma about 1 month ago
This packaged version worked:arvados-api-server 3.1.0~dev20241126144535-1
This isn't working:arvados-api-server 3.1.0~dev20241210220956-1
Updated by Brett Smith about 1 month ago
- Related to Bug #22349: RHEL8 Appstream Ruby not useable on 3.0 added
Updated by Brett Smith about 1 month ago
Zoë Ma wrote in #note-1:
This isn't working:
arvados-api-server 3.1.0~dev20241210220956-1
Confirmed this is the last build from the 22349-deploy-bundle-passenger branch, so it should be least buggy.
Zoë, you're welcome to browse that branch, but in short, it changes the way we deploy the Rails API server backend: now instead of serving it directly from nginx, it runs as a standalone Passenger process (supervised by systemd). It comes with this upgrade note:
The Arvados Rails API server now runs from a standalone Passenger server to simplify deployment. Before upgrading, existing deployments should remove the Rails API server from their nginx configuration. e.g., remove the entire
server
block withroot /var/www/arvados-api/current/public
from/etc/nginx/conf.d/arvados-api-and-controller.conf
. If you customized this deployment at all, the updated install instructions (/doc/install/install-api-server.html#railsapi-config
) explain how to customize the standalone Passenger server.
My first question is, did you actually deploy from scratch, or were you upgrading an existing deployment? There are changes to the Salt installer to configure Rails API correctly, but if you're working from a directory created by the Salt installer previously, you wouldn't pick up those changes automatically. You could be in a situation where you are trying to serve the Rails API backend twice, and I don't know what would happen in that case.
Second, you wrote that /etc/arvados/config.yml
had 620 permissions. Any chance that was a typo? They should at least be 640. By itself, 620 would be enough for a web server to get a permission denied error.
Updated by Zoë Ma about 1 month ago
Sorry - you're right it was a typo and the permission bits was indeed 640.
I was deploying from scratch on a guest running Ubuntu 22.04 (jammy), by following the recipes linked to in the main post.
In those recipes the instruction for deployment was to install from the 'development' branch (binary packages will be downloaded from the jammy-dev
repo instead of jammy
; see details). The reason was that the jammy
repo used to be empty.
I'm going to deploy again but using the 'production' packages for jammy, and see if the problem arises there too.
Meanwhile I'll also keep a guest running the 'development' packages so you can request more useful info from me if necessary.
Updated by Brett Smith about 1 month ago
It occurs to me, even starting from scratch, you would've been using a Salt formula that wasn't updated for the changes, but got the packages with them. That was basically bound to never work.
The branch just got merged to main. If you're okay trying from scratch again, it seems best to start from there.
Updated by Zoë Ma about 1 month ago
Thank you, Brett. I think you're right. There was a mismatch between the Salt sources and the binary packages.
Now that the binaries have also been updated I'll be investigating further. There seems to be another error (passenger segfaults); more on this later.
Updated by Brett Smith about 1 month ago
Super short version: the "agent" binary that Passenger 6.0.23 downloads seems to be bad. It causes all our test-provision jobs to fail the same way: test-provision: #1068
This agent is part of the Passenger standalone server, so we started using this during the development of #22349. However, that branch still had Passenger 6.0.18 in Gemfile.lock
. That version of Passenger also downloads an agent binary, but it works without problem out of the box, so the issue wasn't discovered in testing.
The issue only arose because at the same time that was in development, #22363 also got done, upgrading us to Passenger 6.0.23. Since the agent doesn't get downloaded until the Rails postinst runs, there was basically no way to discover this issue until both branches were combined and then put through test-provision, which is what happened after #22349 got merged this morning.
I believe we can work around the issue by explicitly compiling the agent in the postinst script, instead of downloading it (the default). I've had initial success with that in my own testing VM. I am testing it on Jenkins now, but the whole build+test cycle is over an hour.
Updated by Brett Smith about 1 month ago
- Target version set to Development 2025-01-08
- Assigned To set to Brett Smith
- Status changed from New to In Progress
- Category set to Deployment
There are two branches that are basically just small bugfix branches to get the Jenkins jobs above passing. (test-provision mostly passed except for apt lock contention on one of the deployments, which might be aggravated by the fact that two jobs tried to run at once.)
arvados branch 22349-passenger-6-0-23 @ 3e7ddccf9130fff3b6ef14274e4ea3279e28f745
arvados-formula branch 22349-arvados-railsapi-service @ commit:3a450591ace93b92a881e89880c4b21ccc422034
They're both small enough that there's no change in scale, no doc changes required, etc.
Updated by Lucas Di Pentima about 1 month ago
- Branch:
22349-arvados-railsapi-service
(arvados-formula)- In
running.sls
file: I'm not sure if setting a "watch: file: ..." relationship on a service is enough for the service to depend on that file, maybe we'll need to also add it to the "require: ..." keyword? - Otherwise LGTM
- In
- Branch:
22349-passenger-6-0-23
- LGTM
Updated by Brett Smith about 1 month ago
Lucas Di Pentima wrote in #note-9:
In
running.sls
file: I'm not sure if setting a "watch: file: ..." relationship on a service is enough for the service to depend on that file, maybe we'll need to also add it to the "require: ..." keyword?
Quoting the documentation:
If the "result" of the watched state is
True
, the watching state will execute normally, and if it isFalse
, the watching state will never run. This part ofwatch
mirrors the functionality of therequire
requisite.
In other words, watch
does what require
does, and then some more.
Updated by Brett Smith about 1 month ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|e679e285891a30f1d45addc91746e2be87274e73.