Story #9359

[Crunch2] Document crunch2 deployment

Added by Tom Clegg over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Documentation
Target version:
Start date:
06/07/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

Start with a wiki page. When we're ready, we'll port it to doc.arvados.org.

Partial list of things to address:
  • Run crunch-dispatch daemon somewhere
  • Install crunch-run package on slurm/worker nodes
  • Kernel config?
  • Docker config?
  • How to test that it's working?

Subtasks

Task #9380: Review [[Crunch2 installation]] wikiResolvedBrett Smith


Related issues

Related to Arvados - Support #9370: [Crunch2] Package crunch-dispatch-local, crunch-dispatch-slurm, crunch-runResolved06/08/2016

History

#1 Updated by Tom Clegg over 5 years ago

  • Release set to 11

#2 Updated by Tom Clegg over 5 years ago

  • Assigned To set to Tom Clegg
  • Target version set to 2016-06-22 sprint
  • Story points set to 1.0

#4 Updated by Brett Smith over 5 years ago

I went over the page and made some minor changes mostly because I'm incorrigible. I took out references to MUNGE because there are other ways to do authorization in SLURM besides that, and I'm not aware of any reason Crunch v2 requires MUNGE specifically. If I'm wrong about that, maybe consider this a prompt to add more details about that in the docs. :)

Along the same lines, I'm wondering about the bit "Ensure the crunch user exists -- and has the same UID, GID, and home directory -- on the dispatch node and all SLURM compute nodes." Are we sure the same UID, GID, and home directory are required in all (or at least most) SLURM configurations? Or is that maybe more MUNGE-specific?

On the whole I think this is a good start. It got me thinking about how we might organize things in the install guide. The first idea I had was to start splitting off subsections where the install guide currently covers Crunch setup. So the TOC would be something like:

  • Manual installation
    • [… non-compute pages…]
    • Set up Crunch v2 on SLURM
      • Install the Crunch SLURM dispatcher
      • Set up SLURM compute nodes
    • Set up Crunch v2 on SGE [after we build that]
      • Install the Crunch SGE dispatcher
      • Set up SGE compute nodes
    • Set up Crunch v1 [pages below are the ones we already have]
      • Install the dispatcher
      • Set up compute nodes

What do you think of that? Maybe lots of instructions would be shared across the different Crunch v2 pages, but that's easy to do with includes.

Re cgroup accounting: Yeah, sure, it seems worth pointing out that Crunch can do more with this enabled.

You may already have this in mind, but the measures of success could use elaboration. What do successful log entries look like? How can I see the job in squeue? And then anything we can do to help the user find their specific container in the list, and highlight the relevant values, will be useful.

You are welcome to treat this as being out of scope, and pretend I didn't say it, but: I wonder if we should start a switch from suggesting runit service definitions to suggesting systemd unit files. systemd is equally capable AFAIK, the setup process is easier to describe (install one file vs. set up a directory hierarchy and multiple files), and included with several supported distributions. The only benefit of runit is that we can standardize on it across all distributions—which is not nothing, but going to become less relevant as the non-systemd distributions go out of use.

#5 Updated by Brett Smith over 5 years ago

Brett Smith wrote:

You are welcome to treat this as being out of scope, and pretend I didn't say it, but: I wonder if we should start a switch from suggesting runit service definitions to suggesting systemd unit files. systemd is equally capable AFAIK, the setup process is easier to describe (install one file vs. set up a directory hierarchy and multiple files), and included with several supported distributions. The only benefit of runit is that we can standardize on it across all distributions—which is not nothing, but going to become less relevant as the non-systemd distributions go out of use.

Talked with Ward about this. He agrees it's the right direction and won't interfere with ops (not using any features in runit that aren't in systemd). Honestly I expected him to be the most skeptical, so the fact that he's in favor makes me think we should do it. Meaning we shouldn't write any more runit scripts in the docs; instead we should just write systemd unit files.

If you like I'd be happy to contribute the actual service definitions, so the branch doesn't have to scope creep to include "learning systemd" for you (unless you want it to).

#6 Updated by Brett Smith about 5 years ago

So the question came up, why not just ship systemd unit files in our packages directly?

That's definitely the direction we should move in long-term, but I think it would require additional development work on the daemons before it would be much use. Right now almost all of them require configuration on the command line to function. There is basically no unit file we can ship that does not require additional configuration—far more configuration than whatever common core we provide.

For now, it honestly seems more user-friendly to give people sort of unit file templates to install and edit. That makes it clear that something needs to be edited, and what that something is. As our daemons become more capable, supporting their own configuration files and service autodiscovery, we can revisit. And that's true on a case-by-case basis. For example, maybe it makes sense to package a unit files for keepproxy now, since it only requires an API host+token to be added to the configuration to be minimally useful. But for any tool where we expect users to configure it through the command line, it's less great.

#7 Updated by Tom Clegg about 5 years ago

Brett Smith wrote:

I went over the page and made some minor changes mostly because I'm incorrigible. I took out references to MUNGE because there are other ways to do authorization in SLURM besides that, and I'm not aware of any reason Crunch v2 requires MUNGE specifically. If I'm wrong about that, maybe consider this a prompt to add more details about that in the docs. :)

Good call, AFAIK there's no particular reason we expect munge. Whatever gets SLURM running is fine.

Along the same lines, I'm wondering about the bit "Ensure the crunch user exists -- and has the same UID, GID, and home directory -- on the dispatch node and all SLURM compute nodes." Are we sure the same UID, GID, and home directory are required in all (or at least most) SLURM configurations? Or is that maybe more MUNGE-specific?

Updated with some "depending on your setup" words, and a way to test that whatever you've done is OK.

...splitting off subsections where the install guide currently covers Crunch setup...

Sounds good to me.

You may already have this in mind, but the measures of success could use elaboration. What do successful log entries look like? How can I see the job in squeue? And then anything we can do to help the user find their specific container in the list, and highlight the relevant values, will be useful.

Yes, planning to fill in more details/examples here, from a (more-)production-like setup if possible. Added notes to avoid forgetting.

You are welcome to treat this as being out of scope, and pretend I didn't say it, but: I wonder if we should start a switch from suggesting runit service definitions to suggesting systemd unit files. systemd is equally capable AFAIK, the setup process is easier to describe (install one file vs. set up a directory hierarchy and multiple files), and included with several supported distributions. The only benefit of runit is that we can standardize on it across all distributions—which is not nothing, but going to become less relevant as the non-systemd distributions go out of use.

(as discussed offline) I think it would be great to include a systemd unit file with the package itself, and skip some copy-paste.

#8 Updated by Tom Clegg about 5 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF