Project

General

Profile

Actions

Bug #17244

closed

Make sure cgroupsV2 works with Arvados

Added by Nico César almost 4 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
2.0
Release relationship:
Auto

Description

Reading

https://docs.docker.com/config/containers/runmetrics/

Running Docker on cgroup v2

Docker supports cgroup v2 experimentally since Docker 20.10. Running Docker on cgroup v2 also requires the following conditions to be satisfied:

containerd: v1.4 or later
runc: v1.0.0-rc91 or later
Kernel: v4.15 or later (v5.2 or later is recommended)

Note that the cgroup v2 mode behaves slightly different from the cgroup v1 mode:

The default cgroup driver (dockerd --exec-opt native.cgroupdriver) is “systemd” on v2, “cgroupfs” on v1.
The default cgroup namespace mode (docker run --cgroupns) is “private” on v2, “host” on v1.
The docker run flags --oom-kill-disable and --kernel-memory are discarded on v2.

With all this changes, we have to make sure that:

  1. We can run a distro that has cgroup v2 by default (As in Fedora 2020) or kernel parameters that boot up with cgroups v2 enabled in systemd (kernel param systemd.unified_cgroup_hierarchy=1) and docker version >= 2020.04
  2. We can guide the admin to upgrade to cgroup v2 and have a test case easy to check that arvados will run

The last point is important because the current error is kindof cryptic:

applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers

There also cryptic messages with a cgroupsv2 enabled host and Docker 19.03.13

Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown.
ERRO[0005] error waiting for container: context canceled

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

We can remove the crunchstat command-line program and debian package rather than update it.


Subtasks 3 (0 open3 closed)

Task #20712: Review 17244-cgroup2ResolvedTom Clegg07/18/2023Actions
Task #20808: Confirm works on dev clustersResolvedTom Clegg07/18/2023Actions
Task #20835: Update tordo compute image kernel config from "hybrid" to "unified" modeResolvedLucas Di Pentima08/09/2023Actions

Related issues

Related to Arvados - Bug #17270: Test for docker cgroups issues in crunch-run works on ubuntu 20.04ResolvedNico CésarActions
Related to Arvados - Bug #20616: "cgroup stats files never appeared" on scale clusterDuplicateLucas Di Pentima07/05/2023Actions
Blocks Arvados - Feature #20756: Support crunchstat tracking and memory limits with singularityResolvedTom CleggActions
Actions #1

Updated by Nico César almost 4 years ago

  • Target version set to 2021-01-20 Sprint
  • Category set to Crunch
Actions #2

Updated by Javier Bértoli over 3 years ago

I tried Arvados with the following setup:

1. Built binaries/images from current master (commit e98f4df4a@arvados)
2. Created a cluster
3. Run the test script from the salt-install test dir
4. With kernel Linux 5.9.0-5-amd64 & cgroups2 (as documented here, I have /sys/fs/cgroup/cgroup.controllers)
5. Using docker 20.10
6. Using containerd 1.4.3
7. When I run the script, I get:

+ cwl-runner hasher-workflow.cwl hasher-workflow-job.yml
INFO /usr/bin/cwl-runner 2.1.1, arvados-python-client 2.1.1, cwltool 3.0.20200807132242
INFO Resolved 'hasher-workflow.cwl' to 'file:///usr/src/arvados/tests/hasher-workflow.cwl'
INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse
INFO Using cluster arvie (https://arvie.arv.local:8000/)
INFO Upload local files: "test.txt" 
INFO Using collection f55e750025853f5b8ccae3ca79240f65+54 (arvie-4zz18-zbm7cmmt5h9d5rg)
INFO Using collection cache size 256 MiB
INFO [container hasher-workflow.cwl] submitted container_request arvie-xvhdp-7jpooik0zd8aj1t
INFO [container hasher-workflow.cwl] arvie-xvhdp-7jpooik0zd8aj1t is Final
ERROR [container hasher-workflow.cwl] (arvie-dz642-4v8xcwcvjvp5j2f) error log:

  2021-01-11T20:56:51.604627332Z crunch-run crunch-run dev (go1.15) started
  2021-01-11T20:56:51.604709650Z crunch-run Executing container 'arvie-dz642-4v8xcwcvjvp5j2f'
  2021-01-11T20:56:51.604763728Z crunch-run Executing on host '27d4cb3c42e2'
  2021-01-11T20:56:51.871544244Z crunch-run Fetching Docker image from collection '0428f2e88f4b398b8489f6c454e7e9ae+261'
  2021-01-11T20:56:51.940054697Z crunch-run Using Docker image id 'sha256:0dd5078a5bec49810c1fcb86b60e1bda6b9c1e12dc2c3de75453b2fd37a55885'
  2021-01-11T20:56:51.943832124Z crunch-run Docker image is available
  2021-01-11T20:56:51.952139500Z crunch-run Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id /tmp/crunch-run.arvie-dz642-4v8xcwcvjvp5j2f.288172359/keep406717434]
  2021-01-11T20:56:52.454639768Z crunch-run Creating Docker container
  2021-01-11T20:56:52.509556810Z crunch-run Attaching container streams
  2021-01-11T20:56:53.205291750Z crunch-run Starting Docker container id '7d91dac5eb133131cc9b131d1f0280810acf9c4eda6209b674546bb885c90606'
  2021-01-11T20:56:53.397951196Z crunch-run error in Run: could not start container: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers -- it is in an invalid state: unknown
  2021-01-11T20:56:53.752428822Z crunch-run Cancelled
ERROR Overall process status is permanentFail
INFO Final output collection None
{}
WARNING Final process status is permanentFail

Using same images and setup with

  • Linux 4.19.0-13-amd64 with systemd 241.7 (with cgroupsv1) works ok.
Actions #3

Updated by Javier Bértoli over 3 years ago

According to this issue, Debian's systemd defaults to cgroupsv2 since 242-7 and docker 20.10.x

Actions #4

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (2021-01-20 Sprint)
Actions #6

Updated by Nico César over 3 years ago

  • Related to Bug #17270: Test for docker cgroups issues in crunch-run works on ubuntu 20.04 added
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Release set to 60
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Target version set to Development 2023-06-21 sprint
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Related to Bug #20616: "cgroup stats files never appeared" on scale cluster added
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-06-21 sprint to Future
Actions #12

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to Development 2023-07-05 sprint
Actions #13

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #14

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Actions #15

Updated by Peter Amstutz over 1 year ago

  • Release deleted (60)
  • Story points set to 2.0
Actions #16

Updated by Tom Clegg over 1 year ago

Test suite should assume host is in hybrid mode, and test both cgroup1 and cgroup2.

Actions #17

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
Actions #18

Updated by Tom Clegg over 1 year ago

Status
  • crunchstat is already a bit crufty: on each sample collection it checks multiple places for stat files and logs when the source changes, but AFAICT this is all pointless now that we wait for the target cgroup to appear before logging any stats
  • Ubuntu 20.04 comes with cgroups in hybrid mode but I/O stats are only available through v1, so the v1/v2 decision is per-statistic
  • Now that I've added "get container root pid from docker-inspect" in crunch-run, the "CID file" mechanism seems redundant
  • Most of the tests (including some in lib/crunchrun) rely on forcing crunchstat into v1 mode

TODO: figure out what crunch-run -cgroup-parent-subsystem=X should do in cgroups v2 (any non-empty value means use current process's cgroup?)

Actions #19

Updated by Tom Clegg about 1 year ago

  • Description updated (diff)
Actions #20

Updated by Tom Clegg about 1 year ago

17244-cgroup2 @ 28a733a8823fedadc34a560935abdd17039cb100 -- developer-run-tests: #3744

Rewrote the stats-collection side, added snapshots of relevant OS files for ubuntu1804, ubuntu2004, ubuntu2204, debian11, debian12.

If using cgroups v2 in unified mode, crunch-run -cgroup-parent-subsystem=X means use the current cgroup (so a system configured to run crunch-run -cgroup-parent-subsystem=cpuset should continue to work when the compute nodes lose v1 support).

The -cgroup-root and -cgroup-parent flags are ignored now. Instead we get the cgroupfs root by reading /proc/mounts, and get the process's cgroup by waiting for docker/singularity to return an inside-container process ID and reading /proc/$PID/cgroup.

It turns out there are many permutations of the various cgroups v1/v2 stats files. Some systems have blkio.io_service_bytes but it's empty and blkio.throttle.io_service_bytes has the real stats... sigh.

The crunchstat command line program is now a subcommand of arvados-server, and its debian package build entry is removed. I thought we were going to delete it entirely, but it's still useful for
  • generating test cases by taking a snapshot of all the procfs/sysfs files it chooses on a given system
  • debugging/testing manually on a new compute image
Actions #21

Updated by Tom Clegg about 1 year ago

Added #20756 for proper singularity support.

Until we do #20756, singularity doesn't put the container in a separate cgroup, so the above branch generates crunchstat logs based on the host, which is misleading. It's probably better to just log "Pid() never returned a process ID", as we were doing before.

17244-cgroup2 @ 9b055a03a8a6b516854459cbf6cda97917b73e91 -- developer-run-tests: #3745

Actions #22

Updated by Tom Clegg about 1 year ago

  • Blocks Feature #20756: Support crunchstat tracking and memory limits with singularity added
Actions #23

Updated by Tom Clegg about 1 year ago

  • Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Actions #24

Updated by Peter Amstutz about 1 year ago

17244-cgroup2 @ 9b055a03a8a6b516854459cbf6cda97917b73e91

  • findCgroup doesn't seem to be tested or have any comments explaining the format of the proc entry that it is reading from.
  • Suggest adding a comment to startHoststat() explaining that the pid 1 cgroup, being the parent of all other processes, contains stats for the whole system *
Actions #25

Updated by Tom Clegg about 1 year ago

Turns out init's cgroup isn't a good way to get host stats, at least on debian 12 -- it looks like it's just tracking memory use by init itself. Updated to use crunch-run's own cgroup instead. This isn't ideal either, but at least it has crunch-run, arv-mount, and keepstore, so much more useful than init.

Added findCgroup comments, and some more tests using the example cgroup files captured from various OSes.

17244-cgroup2 @ 31779a06b28e21a9409ec7c6310f0871b65d13ff -- developer-run-tests: #3748

Actions #27

Updated by Peter Amstutz about 1 year ago

So the overall strategy seems to be to identify the various files that need to be read (looking in both the v2 and v1 locations) and put them in a map, then the code to actually read the stats goes through the map and opens the relevant file for each stat (wherever it is) and gets the numbers it needs -- because the files themselves are in more or less the same format between v1 and v2? Is that right?

The other substantial change seems to be that we don't use container ID to determine the cgroup, we start from the container's init process (pid 1 inside the container, pid whatever outside the container) and work backwards to figure out what cgroups contain the container, is that right?

cgroupParentSubsystem := flags.String("cgroup-parent-subsystem", "", "use current cgroup for given `subsystem` as parent cgroup for container (subsystem argument is only relevant for cgroups v1; in cgroups v2 / unified mode, any non-empty value means use current cgroup)")

The message doesn't say what the default behavior is, especially for v2 "any non-empty value means use current cgroup" but doesn't say how that is different from the default behavior? I have no idea when I'd use this parameter.

Are the cgroup-root and cgroup-parent options also no longer relevant for cgroups v1? It looks like it has a more sophisticated method to discover where cgroups are on the system, so we just assume that covers all cases?

Minor thing but the message that states where it is getting stats seems to have changed from notice: reading stats from /sys/fs/cgroup/... to just using /... which is a bit less clear to the reader what the log message is supposed to be telling you.

Actions #28

Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote in #note-27:

So the overall strategy seems to be to identify the various files that need to be read (looking in both the v2 and v1 locations) and put them in a map, then the code to actually read the stats goes through the map and opens the relevant file for each stat (wherever it is) and gets the numbers it needs -- because the files themselves are in more or less the same format between v1 and v2? Is that right?

Yes. The files that exist in both v1 and v2 world have the same format in both. So the setup stage figures out which files are usable (present and not empty), and then we don't need to do the trial-and-error thing every 10 seconds like we did before.

The other substantial change seems to be that we don't use container ID to determine the cgroup, we start from the container's init process (pid 1 inside the container, pid whatever outside the container) and work backwards to figure out what cgroups contain the container, is that right?

Yes.

cgroupParentSubsystem := flags.String("cgroup-parent-subsystem", "", "use current cgroup for given `subsystem` as parent cgroup for container (subsystem argument is only relevant for cgroups v1; in cgroups v2 / unified mode, any non-empty value means use current cgroup)")

The message doesn't say what the default behavior is, especially for v2 "any non-empty value means use current cgroup" but doesn't say how that is different from the default behavior? I have no idea when I'd use this parameter.

Yeah, that deserves a better explanation. I've added a link to the slurm install section where we recommend using it, and noted what default/blank means.

The docs say

If your Slurm cluster uses the task/cgroup TaskPlugin, you can configure Crunch’s Docker containers to be dispatched inside Slurm’s cgroups. This provides consistent enforcement of resource constraints.

I think if you don't do this, the memory we ask for in the "srun" command only limits crunch-run's child processes, not the container itself, because the container gets re-homed under the docker daemon's cgroup.

Are the cgroup-root and cgroup-parent options also no longer relevant for cgroups v1? It looks like it has a more sophisticated method to discover where cgroups are on the system, so we just assume that covers all cases?

Yes. Essentially the cgroup-root and cgroup-parent options existed because we didn't have the code to use the actual linux APIs to find the cgroup files.

Minor thing but the message that states where it is getting stats seems to have changed from notice: reading stats from /sys/fs/cgroup/... to just using /... which is a bit less clear to the reader what the log message is supposed to be telling you.

Good point. Restored.

17244-cgroup2 @ 24662d47cee534f72787667620451358d95ab5ec -- developer-run-tests: #3753

Actions #29

Updated by Peter Amstutz about 1 year ago

See https://doc.arvados.org/main/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups

I think this should be https://doc.arvados.org/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups (without /main/) so that it links to the most recent stable docs not the development docs.

The rest LGTM although we should plan to do some post-merge testing to verify that this does in fact work on our various test clusters.

Actions #30

Updated by Tom Clegg about 1 year ago

I think this should be https://doc.arvados.org/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups (without /main/) so that it links to the most recent stable docs not the development docs.

Oops, yes, fixed.

The rest LGTM although we should plan to do some post-merge testing to verify that this does in fact work on our various test clusters.

Agreed.

Actions #31

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Actions #32

Updated by Tom Clegg about 1 year ago

After fixing a stupid bug in 0b296b5a9, confirmed the updated crunchstat is using cgroups v2 on tordo.

2023-08-08T16:50:58.455605354Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpuset.cpus.effective
2023-08-08T16:50:58.455636658Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpu.stat
2023-08-08T16:50:58.455659078Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/io.stat
2023-08-08T16:50:58.455691719Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.stat
2023-08-08T16:50:58.455710535Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.current
2023-08-08T16:50:58.455728355Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.swap.current
2023-08-08T16:50:58.455764250Z using /proc/1864/net/dev
2023-08-08T16:50:58.455768209Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-90pv9gjl8h27if6.1388816401
2023-08-08T16:50:58.455880720Z mem 0 swap 0 pgmajfault 483328 rss
2023-08-08T16:50:58.456829468Z cpu 0.0358 user 0.0095 sys 2 cpus
2023-08-08T16:50:58.456864772Z blkio:259:0 0 write 151552 read
2023-08-08T16:50:58.456897117Z net:eth0 0 tx 470 rx
2023-08-08T16:50:58.456913549Z statfs 199172988928 available 310394880 used 210237366272 total
Signs this is the new crunchstat:
  • no "cache" in mem stats (only swap, pgmajfault, rss)
  • reading stats from /sys/fs/cgroup/system.slice/... not /sys/fs/cgroup/blkio/...
Actions #33

Updated by Tom Clegg about 1 year ago

Current tordo image is not giving us the number of CPUs in any of the expected ways, see #20835#note-8. Need to investigate.

Actions #34

Updated by Tom Clegg about 1 year ago

Turns out "cpuset" is not the best way to check # cpus available to a docker container. There's a "cpu.max" file that indicates fractional shares (like docker run --cpus=1.5). AFAICT when docker doesn't limit cpu usage, that file says "max" instead of a number, and we just have to look at /proc/cpuinfo to determine how many CPUs the host has.

Updated test fixtures for debian11, debian12, ubuntu1804, ubuntu2004, ubuntu2204 accordingly, and added a debian10 test fixture obtained from the tordo compute image.

17244-cgroup2-cpu-max @ 2f77cdcb71af8a6d250397a808faf5eec665571a -- developer-run-tests: #3776

Actions #35

Updated by Tom Clegg about 1 year ago

Tested dev version on tordo:

2023-08-15T14:13:58.987090099Z crunch-run 2f77cdcb71af8a6d250397a808faf5eec665571a-dev (go1.20.6) started
2023-08-15T14:14:23.200104457Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/cpu.max
2023-08-15T14:14:23.200170743Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/cpu.stat
2023-08-15T14:14:23.202355561Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/io.stat
2023-08-15T14:14:23.203292975Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.stat
2023-08-15T14:14:23.203535198Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.current
2023-08-15T14:14:23.203568746Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.swap.current
2023-08-15T14:14:23.203627404Z using /proc/3100/net/dev
2023-08-15T14:14:23.203633314Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-5jhhvnoh4wnea5a.1554444433
2023-08-15T14:14:23.204016483Z mem 0 swap 0 pgmajfault 2613248 rss
2023-08-15T14:14:23.205581213Z cpu 0.0615 user 0.0081 sys 1.00 cpus
2023-08-15T14:14:23.205633511Z blkio:259:0 0 write 225280 read
2023-08-15T14:14:23.205639453Z blkio:259:4 0 write 1740800 read
2023-08-15T14:14:23.205642999Z blkio:254:0 0 write 1740800 read
2023-08-15T14:14:23.205674750Z net:eth0 0 tx 520 rx
2023-08-15T14:14:23.205689830Z statfs 198700032000 available 783351808 used 210237366272 total
Actions #36

Updated by Lucas Di Pentima about 1 year ago

Just one suggestion:

  • If users could be relying on crunchstat-summary's report format, it would be convenient to make an upgrade note saying that the "cpu" category is not a integer number anymore.

Other than that, it LGTM.

Actions #37

Updated by Tom Clegg about 1 year ago

  • % Done changed from 66 to 100
  • Status changed from In Progress to Resolved
Actions #38

Updated by Peter Amstutz about 1 year ago

  • Release set to 66
Actions

Also available in: Atom PDF