Project

General

Profile

Actions

Bug #22389

closed

Single-host single-hostname installation fails: Rails API server cannot start; /etc/arvados/config.yml "permission denied

Added by Zoë Ma 11 days ago. Updated 8 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-

Description

Steps to reproduce:
Follow the recipes for installing Arvados in virtual machine:
1 Create base image: https://docs.google.com/document/d/1Groandn4iLw-2f6PGNlmQhdp5sO3K0LICnUw3RhMzEw/edit?usp=sharing
2 Install Arvados in guest (up to this step) https://docs.google.com/document/d/1w6DQqR3D65DcTCpMX51RrDvz547bYImCrPoY8A71-bE/edit?tab=t.0#bookmark=id.m117nxmcbdvw

The installation will fail; the Rails API server will fail to start.

Inside the guest, following /var/log/nginx/error.log, we can see errors like those seen in the attached files (the referenced Passenger HTML error report file is also attached)

Notably there is a line about

App 26718 output: open /etc/arvados/config.yml: permission denied

which I think explains the subsequent failure to get database password (which is in the config file).

It's unclear to me what is being denied exactly, and how this could happen. After the installation, the config file has owner root:www-data and permissions 620; the directory /etc/arvados has the right permissions too (owner root:www-data, permission 750).

I also tried setting permission to everyone-readable on the config file and everyone-searchable on the /etc/arvados directory, but this did not resolve the problem.


Files

passenger-error-pVsjPr.html (403 KB) passenger-error-pVsjPr.html Phusion Passenger error report file in HTML Zoë Ma, 12/10/2024 10:23 PM
error.log (2.91 KB) error.log /var/log/nginx/error.log excerpt Zoë Ma, 12/10/2024 10:31 PM

Subtasks 2 (0 open2 closed)

Task #22392: Review arvados branch 22349-passenger-6-0-23ResolvedBrett Smith12/12/2024Actions
Task #22393: Review arvados-formula branch 22349-arvados-railsapi-serviceResolvedBrett Smith12/13/2024Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #22349: RHEL8 Appstream Ruby not useable on 3.0ResolvedActions
Actions #1

Updated by Zoë Ma 11 days ago

This packaged version worked:
arvados-api-server 3.1.0~dev20241126144535-1

This isn't working:
arvados-api-server 3.1.0~dev20241210220956-1

Actions #2

Updated by Brett Smith 11 days ago

  • Related to Bug #22349: RHEL8 Appstream Ruby not useable on 3.0 added
Actions #3

Updated by Brett Smith 11 days ago

Zoë Ma wrote in #note-1:

This isn't working:
arvados-api-server 3.1.0~dev20241210220956-1

Confirmed this is the last build from the 22349-deploy-bundle-passenger branch, so it should be least buggy.

Zoë, you're welcome to browse that branch, but in short, it changes the way we deploy the Rails API server backend: now instead of serving it directly from nginx, it runs as a standalone Passenger process (supervised by systemd). It comes with this upgrade note:

The Arvados Rails API server now runs from a standalone Passenger server to simplify deployment. Before upgrading, existing deployments should remove the Rails API server from their nginx configuration. e.g., remove the entire server block with root /var/www/arvados-api/current/public from /etc/nginx/conf.d/arvados-api-and-controller.conf. If you customized this deployment at all, the updated install instructions (/doc/install/install-api-server.html#railsapi-config) explain how to customize the standalone Passenger server.

My first question is, did you actually deploy from scratch, or were you upgrading an existing deployment? There are changes to the Salt installer to configure Rails API correctly, but if you're working from a directory created by the Salt installer previously, you wouldn't pick up those changes automatically. You could be in a situation where you are trying to serve the Rails API backend twice, and I don't know what would happen in that case.

Second, you wrote that /etc/arvados/config.yml had 620 permissions. Any chance that was a typo? They should at least be 640. By itself, 620 would be enough for a web server to get a permission denied error.

Actions #4

Updated by Zoë Ma 10 days ago

Sorry - you're right it was a typo and the permission bits was indeed 640.

I was deploying from scratch on a guest running Ubuntu 22.04 (jammy), by following the recipes linked to in the main post.

In those recipes the instruction for deployment was to install from the 'development' branch (binary packages will be downloaded from the jammy-dev repo instead of jammy; see details). The reason was that the jammy repo used to be empty.

I'm going to deploy again but using the 'production' packages for jammy, and see if the problem arises there too.

Meanwhile I'll also keep a guest running the 'development' packages so you can request more useful info from me if necessary.

Actions #5

Updated by Brett Smith 10 days ago

It occurs to me, even starting from scratch, you would've been using a Salt formula that wasn't updated for the changes, but got the packages with them. That was basically bound to never work.

The branch just got merged to main. If you're okay trying from scratch again, it seems best to start from there.

Actions #6

Updated by Zoë Ma 10 days ago

Thank you, Brett. I think you're right. There was a mismatch between the Salt sources and the binary packages.

Now that the binaries have also been updated I'll be investigating further. There seems to be another error (passenger segfaults); more on this later.

Actions #7

Updated by Brett Smith 10 days ago

Super short version: the "agent" binary that Passenger 6.0.23 downloads seems to be bad. It causes all our test-provision jobs to fail the same way: test-provision: #1068

This agent is part of the Passenger standalone server, so we started using this during the development of #22349. However, that branch still had Passenger 6.0.18 in Gemfile.lock. That version of Passenger also downloads an agent binary, but it works without problem out of the box, so the issue wasn't discovered in testing.

The issue only arose because at the same time that was in development, #22363 also got done, upgrading us to Passenger 6.0.23. Since the agent doesn't get downloaded until the Rails postinst runs, there was basically no way to discover this issue until both branches were combined and then put through test-provision, which is what happened after #22349 got merged this morning.

I believe we can work around the issue by explicitly compiling the agent in the postinst script, instead of downloading it (the default). I've had initial success with that in my own testing VM. I am testing it on Jenkins now, but the whole build+test cycle is over an hour.

build-packages-multijob: #4463

test-provision: #1069

Actions #8

Updated by Brett Smith 10 days ago

  • Target version set to Development 2025-01-08
  • Assigned To set to Brett Smith
  • Status changed from New to In Progress
  • Category set to Deployment

There are two branches that are basically just small bugfix branches to get the Jenkins jobs above passing. (test-provision mostly passed except for apt lock contention on one of the deployments, which might be aggravated by the fact that two jobs tried to run at once.)

arvados branch 22349-passenger-6-0-23 @ 3e7ddccf9130fff3b6ef14274e4ea3279e28f745

arvados-formula branch 22349-arvados-railsapi-service @ commit:3a450591ace93b92a881e89880c4b21ccc422034

They're both small enough that there's no change in scale, no doc changes required, etc.

Actions #9

Updated by Lucas Di Pentima 9 days ago

  • Branch: 22349-arvados-railsapi-service (arvados-formula)
    • In running.sls file: I'm not sure if setting a "watch: file: ..." relationship on a service is enough for the service to depend on that file, maybe we'll need to also add it to the "require: ..." keyword?
    • Otherwise LGTM
  • Branch: 22349-passenger-6-0-23
    • LGTM
Actions #10

Updated by Brett Smith 9 days ago

Lucas Di Pentima wrote in #note-9:

In running.sls file: I'm not sure if setting a "watch: file: ..." relationship on a service is enough for the service to depend on that file, maybe we'll need to also add it to the "require: ..." keyword?

Quoting the documentation:

If the "result" of the watched state is True, the watching state will execute normally, and if it is False, the watching state will never run. This part of watch mirrors the functionality of the require requisite.

In other words, watch does what require does, and then some more.

Actions #11

Updated by Brett Smith 9 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF