Feature #21678: installer.sh option to run diagnostics on one of the internal instances that it installed instead of only locally - Arvados

All agreed upon points are implemented / addressed.
- Yes
Anything not implemented (discovered or discussed during work) has a follow-up story.
- Brett asked to have this change be exercised by our test-provision-X pipelines, but those pipelines only test the provision.sh script, not the installer. We should add a new story so that we can test the installer script both in single and multi node modes.
Code is tested and passing, both automated and manual, what manual testing was done is described
- Tested it manually by deploying a multi node test cluster in our Sandbox account and run the diag tests on its shell node.
Documentation has been updated.
- Yes.
Behaves appropriately at the intended scale (describe intended scale).
- Yes. The intended scale is just to run the diagnostics on just 1 internal node, so if the cluster has multiple shell nodes, it'll just use the first one on the list.
Considered backwards and forwards compatibility issues between client and server.
- Yes: for backwards compatibility, instead of modifying the current diagnostics sub-command, I've added a new one diagnostics-internal (not happy with its name, though)
Follows our coding standards and GUI style guidelines.
- Yes

This branch adds a diagnostics-internal (alternate name ideas are welcome) sub-command to installer.sh. When used, it will first check if there's at least one node defined with the shell role, and use the first one of the list to run arvados-client diagnostics -internal-client remotely. The diagnostics gets run with sudo to ensure the process is able to use docker (currently, the admin user isn't part of the docker group and I don't think we need to fiddle with that)

Actions

Copy link

Updated by Lucas Di Pentima 11 months ago

Ticket for updating test-provision jobs is https://dev.arvados.org/issues/19232

Actions

Copy link

Updated by Peter Amstutz 11 months ago

Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint

Actions

Copy link

Updated by Brett Smith 11 months ago

Lucas Di Pentima wrote in #note-4:

21678-installer-diagnostics-internal @ 887751a

This mostly looks good. I pushed a commit with small copyedits to the new documentation to try to clarify it a little. If you don't like it, you're welcome to delete and re-push, this just seemed easier than trying to write out the "word diff" in a comment. Or if you just want to copyedit my copyedits, that's fine too.

I just have a couple of questions about the command that runs the diagnostics itself:

$SSH $DEPLOY_USER@$TESTNODE "sudo ARVADOS_API_HOST=${DOMAIN}:${CONTROLLER_EXT_SSL_PORT} ARVADOS_API_TOKEN=$SYSTEM_ROOT_TOKEN arvados-client diagnostics -internal-client"

Why does this use sudo? I don't think arvados-client needs privileges beyond having the system root token, does it?
The way this is written means that the system root token will leak in the system's process list when sudo is first invoked. This is particularly a problem since we're running the command on the shell node, where other users might have accounts and be able to use this to escalate privileges. Can we use an implementation that doesn't put the token in any command's argument list?

Anything not implemented (discovered or discussed during work) has a follow-up story.

Brett asked to have this change be exercised by our test-provision-X pipelines, but those pipelines only test the provision.sh script, not the installer.

Yeah I think I had both tickets mixed up in my head when I made this comment, that's totally fine. Sorry for my confusion.

Thanks.

Actions

Copy link

Updated by Lucas Di Pentima 11 months ago

Brett Smith wrote in #note-7:

This mostly looks good. I pushed a commit with small copyedits to the new documentation to try to clarify it a little. If you don't like it, you're welcome to delete and re-push, this just seemed easier than trying to write out the "word diff" in a comment. Or if you just want to copyedit my copyedits, that's fine too.

Thanks, they look great to me.

Why does this use sudo? I don't think arvados-client needs privileges beyond having the system root token, does it?

I didn't wanted to fiddle with the shell's user group membership to make it able to use docker. The default diagnostics behavior builds a docker image to use for the tests.

The way this is written means that the system root token will leak in the system's process list when sudo is first invoked. This is particularly a problem since we're running the command on the shell node, where other users might have accounts and be able to use this to escalate privileges. Can we use an implementation that doesn't put the token in any command's argument list?

Good catch, I'll use sudo's -E to forward the env vars from the original user's session.

Actions

Copy link

Updated by Lucas Di Pentima 11 months ago

Updates at ef4f350

Instead of passing credentials through environment variables (because doing that properly would require to reconfigure the remote SSH daemon to accept them), we now set up a ~/.config/arvados/settings.conf file in the remote host's root account previous to executing arvados-client.

This not only removes the token leakage on the remote host's process list, but also minimize it on the local host's process list, as it would only be visible while writing it to the temp config file.

Actions

Copy link

#10

Updated by Brett Smith 11 months ago

Lucas Di Pentima wrote in #note-9:

Updates at ef4f350

I didn't wanted to fiddle with the shell's user group membership to make it able to use docker.

I don't love this either, since running the whole thing as root is an awfully big hammer just to get membership to the docker group. But since the docker group is root-equivalent anyway, I can live with it.

Instead of passing credentials through environment variables (because doing that properly would require to reconfigure the remote SSH daemon to accept them), we now set up a ~/.config/arvados/settings.conf file in the remote host's root account previous to executing arvados-client.

This solves the leaking credentials problem, but now we've got a configuration management problem. This overwrites the user's arvados/settings.conf on the shell node, which they might not be expecting. And then it leaves the system root token there, which is not ideal from a security perspective either, especially since the user might not realize it got written there.

What if we just piped a script to bash like this?

$SSH $DEPLOY_USER@$TESTNODE bash <<EOF
export ARVADOS_API_HOST="${DOMAIN}:${CONTROLLER_EXT_SSL_PORT}" 
export ARVADOS_API_TOKEN="$SYSTEM_ROOT_TOKEN" 
sudo -E arvados-client diagnostics -internal-client
EOF

The here-doc gets evaluated entirely inside the shell, so none of this leaks into that process list on the local system.
export is a shell built-in, so those variables don't leak into the process list on the shell node either.
This avoids writing any files and is hopefully simpler, assuming it works.

Actions

Copy link

#11

Updated by Lucas Di Pentima 11 months ago

Brett Smith wrote in #note-10:

What if we just piped a script to bash like this?

[...]

The here-doc gets evaluated entirely inside the shell, so none of this leaks into that process list on the local system.

export is a shell built-in, so those variables don't leak into the process list on the shell node either.

This avoids writing any files and is hopefully simpler, assuming it works.

That's a great solution, wasn't aware of that here-doc behavior. Thanks!

Here's the updated version: c0c47d7650 -- I've also explicitly listed which env vars should sudo preserve, to make it less nasty.

Actions

Copy link

#12