Project

General

Profile

Actions

Idea #17344

closed

[boot] Make arvados-server-easy package suitable for demo use case

Added by Tom Clegg about 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

Resolve outstanding issues:
  • Install arv-mount so a-d-c loopback driver can use it
  • Avoid leaving system in inconvenient state if arvados-server init doesn't go well
  • Save a docker image (alpine linux? hello world?) during "init", and use it instead of arvados/jobs in diagnostics
  • Document firewall / accessible port requirements
  • Sanity-check dns/firewall early in arvados-server init
  • Remove setup roadblocks (e.g., use PAM instead of Google API keys)
  • Fix internal/external client detection so remote clients don't try to connect to keepstore at 0.0.0.0:9010
  • Link "next steps" section to relevant doc pages
  • Add "make an admin user" to next steps
  • Review/remove obsolete package dependencies (libpython2.7, *-dev?)

Files

dispatch-cloud.log (44.4 KB) dispatch-cloud.log Lucas Di Pentima, 09/06/2022 03:48 PM

Subtasks 2 (0 open2 closed)

Task #19244: Review 17344-easy-demoResolvedTom Clegg07/15/2022Actions
Task #19449: Review 17344-easy-demoResolvedTom Clegg07/15/2022Actions

Related issues

Related to Arvados - Idea #16306: [install] Build all-in-one server package using arvados-server install/boot in production modeResolvedTom Clegg09/22/2020Actions
Related to Arvados Epics - Idea #15941: arvados-bootNewActions
Related to Arvados Epics - Idea #18337: Easy install via OS packageIn ProgressActions
Actions #1

Updated by Tom Clegg about 3 years ago

  • Related to Idea #16306: [install] Build all-in-one server package using arvados-server install/boot in production mode added
Actions #2

Updated by Tom Clegg about 3 years ago

Actions #3

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Target version set to 2022-07-20
Actions #5

Updated by Tom Clegg almost 2 years ago

  • Related to Idea #18337: Easy install via OS package added
Actions #6

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg almost 2 years ago

  • Status changed from New to In Progress
  • Description updated (diff)
Actions #8

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #10

Updated by Tom Clegg almost 2 years ago

  • Description updated (diff)
Actions #11

Updated by Lucas Di Pentima almost 2 years ago

Sorry for the delay, here're some comments:

  • The ticket mentions a "demo" mode, is the "single-host production" auto install also the demo? I think the "demo mode" could be configured to set the first user as an admin, and also auto-activate new users.
  • Could we add the postgresql & docker.io packages as dependencies so it gets auto-installed when necessary? If we aim to do a single node install, those dependencies are needed on the same host, or do you think of another possibility?
  • In lib/install/deps.go:L647 Do you think we could use a dynamic amount of parallel jobs depending on the available cpu cores? I think it would be beneficial if we then decide to use a high CPU worker for the package build pipeline.
  • Question: The version number selected for the package is "2.1.0", is this due to the branch being created from 16652's branch that was started on March?
  • While thinking about ways how we can get the diagnostics tool to be usable anywhere, I thought about 2 ideas:
    • Given that the alpine docker image is so small (5.6 MB) we could somehow embed it on our arvados-client so that it can upload it to keep if necessary.
    • If we don't want binary blobs inside our own binary, we could use a tool like skopeo (https://github.com/containers/skopeo) to download it to the local filesystem instead of needing the docker daemon.
      • Although it's a interesting project, I guess having to install it (and its dependencies) would be as annoying as installing docker to get the same effect? Not sure if it can be used as a library just for the purpose of downloading docker images from the registry.
  • In lib/install/init.go:L118-125, shouldn't be better to iterate over a list of port numbers? AFAICT, if ports 4440 & 443 are already taken, the current code doesn't fail.
  • After initialization, the message is: "Setup complete, you can access wb at xxxx"... do you think it would be useful to also suggest the admin to do a diagnostics run? Or maybe execute it automatically before the "setup complete" message?
  • The docs say that the user should be setup by username, but when I tried I got this (lack of --user on the docs' example):
    root@debian-s-4vcpu-8gb-nyc3-01:~# arv sudo user setup lucas
    Top level ::CompositeIO is deprecated, require 'multipart/post' and use `Multipart::Post::CompositeReadIO` instead!
    Top level ::Parts is deprecated, require 'multipart/post' and use `Multipart::Post::Parts` instead!
    Error: //railsapi.internal/arvados/v1/users/setup: 422 Unprocessable Entity: #<ArgumentError: Required uuid or user> (req-1s86olhxcvhrlap8h424)
    
  • The initial user wasn't set up as an admin user, so I think the docs could also say how to set a user as admin via the CLI?
  • In the docs section about customizing the cluster, maybe we can have some of those bulletpoints linked to sections of the documentation about manual install/config?
Actions #12

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #13

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #14

Updated by Tom Clegg over 1 year ago

I haven't been thinking of this as a separate "demo mode" per se -- rather, getting the single-node production install far enough along to use as a demo, but not necessarily functional enough to recommend for production yet (e.g., doesn't handle database migrations yet).

Activation/admin setup could definitely be made smoother/easier. If possible I'd like to solve this in the secure/private case, rather than lean on insecure/open settings for the sake of convenience. Ideas:
  • make a command more like arv sudo user setup [--admin] $username
  • make arv sudo user setup $username work even if it's run before the user's first login (we made a system for this so we could pre-approve people based on their Google account address, but I'm not sure whether it works in the PAM case)
  • option to auto-activate + auto-admin when using PAM and user is in a specified group (like "sudo" or "adm")
  • an arv sudo ... command that [creates a new user] and prints a https://wb2/token?api_token=... link to log you in right away

postgresql & docker.io packages as dependencies so it gets auto-installed when necessary

Both postgresql server and docker daemon seem a bit much to install where they're not needed. Depending on how you define "single-node install", postgresql server might be on a different host, or a cloud service. Docker isn't needed on server nodes in normal usage, only for the sake of diagnostics. (Also, although we're not there yet, my intent is to make a multi-node cluster something like "on each host, install arvados-server-easy, then do this "join" command".)

I was even wondering if we can remove the gitolite dependency (and its annoying interactive prompt during package install) and automatically disabling the git features if it's not installed.

How about making the install instructions say "apt install postgresql docker.io arvados-server-easy", with notes about omitting them (or removing them afterward) if not needed?

dynamic amount of parallel jobs depending on the available cpu cores

Oh yeah, good catch. Done.

version number selected for the package is "2.1.0"

Yes, it uses the same rules as the existing package scripts: for real published packages the caller should be specifying the version (arvados-package build -package-version=2.4.1), otherwise we use source:build/version-at-commit.sh to guess something based on the git history.

embed alpine docker image

Hm, I kinda like this idea. Is there an even lighter image that would be useful for testing? It really doesn't need to do much. Yes! there is https://hub.docker.com/_/hello-world -- "docker save" makes a 24064 byte .tar file.

better to iterate over a list of port numbers? AFAICT, if ports 4440 & 443 are already taken, the current code doesn't fail

Oops, yes. Fixed. And now it tests all of 4440-4460, not just 4440.

suggest the admin to do a diagnostics run? Or maybe execute it automatically before the "setup complete" message?

Suggesting seems good -- added. I'm not sure about doing it automatically. I like the idea of teaching the user to use 'arv sudo diagnostics' themself early in the game.

arv sudo user setup lucas

Oh yeah. I wrote that in the docs because it would be nice if it really looked that way. Currently I think you need to say --uuid {paste_uuid_here} and getting the UUID was too annoying to document.

In the docs section about customizing the cluster, maybe we can have some of those bulletpoints linked

Added some links. The existing doc pages aren't exactly right for this context (e.g., telling you to install arvados-dispatch-cloud) but it's a start.

17344-easy-demo @ a2d23c038780134c812249e74d9e6d1b7cad69b6 -- developer-run-tests: #3240

Actions #15

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ d15f485909cf84aeda62c0a843f384cb218e0125 -- developer-run-tests: #3241

Removes some dev-only/outdated package dependencies

Actions #16

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ c966970d64c21d7adaf1c3c8b737aa9e7c166f0e

Adds -create-db=false option, with connection info accepted from POSTGRES_HOST/USER/DB/PASSWORD env vars

Actions #17

Updated by Lucas Di Pentima over 1 year ago

This LGTM, thanks!

Actions #18

Updated by Tom Clegg over 1 year ago

  • Target version changed from 2022-07-20 to 2022-08-03 Sprint
Actions #19

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #20

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-08-03 Sprint to 2022-08-17 sprint
Actions #21

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-08-17 sprint to 2022-08-31 sprint
Actions #22

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ 19c5342a76ab9474c3c8eb5c0e7903c58203a055 -- developer-run-tests: #3273

This makes remote upload/download work in a cloud VM demo install:
  • wb2 client is recognized as external, so controller tells it to use keepproxy
  • keepproxy is recognized as internal, so controller tells it to use keepstore
Making this work on a cloud VM required some trickery.
  • Autocert requires the external controller hostname to resolve to a publicly routable IP address that lands on the controller host.
  • The publicly routable IP address is not bound to a local interface: when the controller host itself connects to the external URL, traffic goes through an external gateway, and the remote address seen by Nginx is not recognizable as a local address.
  • Even if we could identify it (e.g., x-forwarded-for), we don't want server-to-server traffic going through that external gateway anyway.

The current/old deployment strategy is to fix this with split-horizon DNS. That's not a suitable approach for an easy-install / quick demo scenario.

The solution here is to have "inside" clients (i.e., server components) connect to the server's network interface rather than resolving the external URL host, but validate the presented TLS certificate based on the external URL host. A new env var "ARVADOS_SERVER_ADDRESS" indicates the address clients should connect to.

To follow through with this, we also need to support ARVADOS_SERVER_ADDRESS in Python and Ruby SDKs, so arv-mount (on shell node, worker VM), ruby arv cli (on server node, shell node), and workbench1 work without split horizon DNS or unnecessary routing through the public IP gateway.

As this branch stands so far, only Go services know how to do this, which is the minimum we need in that keepproxy outright breaks without it.

Actions #23

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ 124e87dbafd6c04c9937f45e90f2662c715bea90 -- developer-run-tests: #3275

Disables an sdk/python keepclient test that relies on sending the X-External-Client header to persuade controller to treat it as an external client and send keepproxy info instead of keepstore info, all so it can test that the discovery code notices and sets the using_proxy flag.

We have this ARVADOS_EXTERNAL_CLIENT env var / settings entry that causes the Python client to set that header. But on a real cluster (and arvbox), Nginx deletes the client-provided header and replaces it with 0 or 1 depending on the remote IP address. So ARVADOS_EXTERNAL_CLIENT has only ever worked in the test suite.

Since ARVADOS_EXTERNAL_CLIENT only works in the test suite, and this is the only test that fails if we ignore it, I'm thinking we should rip out all the ARVADOS_EXTERNAL_CLIENT stuff, and rewrite this test so it mocks a keep_services/accessible response, instead of convincing the whole nginx/controller/rails stack to return a proxy.

Actions #24

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ 3d99c1541a450411a847c1c2b87721a4c51b484e -- developer-run-tests: #3276

Replace the using_proxy test with one that uses a mock.

Actions #25

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ 10440ac12d6771ab80469adf551d2cac8d3461e6 -- developer-run-tests: #3281
  • removes ARVADOS_EXTERNAL_CLIENT
Actions #26

Updated by Tom Clegg over 1 year ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #27

Updated by Lucas Di Pentima over 1 year ago

I have been testing the new branch on a freshly created DigitalOcean droplet, the diagnostics run only failed on the test container run.

Attached is the dispatch-cloud part of the logs, just in case there's some clue of what was going on.

I'll retry with another VPS just in case I did something wrong.

Actions #28

Updated by Lucas Di Pentima over 1 year ago

I've retried everything with a new VPS and installing Postgresql & docker before initializing the cluster. It worked great!

Actions #29

Updated by Tom Clegg over 1 year ago

I suspect crunch-run set the "broken node" flag. Real drivers fix it by destroying the node and creating a new one. Loopback driver needs to explicitly delete it.

17344-easy-demo @ ee158449ac8cc70708a161cd36845f57b5a248f1

Actions #30

Updated by Lucas Di Pentima over 1 year ago

This LGTM. I have a related comment:

  • Package building finished with a message like: {:timestamp=>"2022-09-05T19:51:55.483136+0000", :message=>"Created package", :path=>"/pkg/arvados-server-easy_2.1.0-2866-g10440ac12_amd64.deb"} -- but the real path was "/tmp/*.deb"

I know this is supposed to be used by CI tools but it could be confusing when debugging a package building pipeline.

Actions #31

Updated by Tom Clegg over 1 year ago

Hm, yes. From fpm's perspective inside the container, the output is always /pkg/*.deb. Maybe we just need to print our own log message after that, with the real (host) path.

Actions #32

Updated by Tom Clegg over 1 year ago

17344-easy-demo @ 0840aec1ec6fdcce4d1a317578bc1f5f5be1a1f6

$ go run ./cmd/arvados-package build -package-dir /tmp -package-version $(git describe) -target-os debian:11
...
-rw-r--r-- 1 tom tom 100955048 Sep  8 10:27 /tmp/arvados-server-easy_2.1.0-2869-g0840aec1e_amd64.deb
Actions #33

Updated by Tom Clegg over 1 year ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|72ed2e6e260d8e12e49716a261b6306d8de13e8d.

Actions #34

Updated by Peter Amstutz over 1 year ago

  • Release set to 47
Actions

Also available in: Atom PDF