Bug #17244
closedMake sure cgroupsV2 works with Arvados
Added by Nico César almost 4 years ago. Updated over 1 year ago.
Description
Reading
https://docs.docker.com/config/containers/runmetrics/
Running Docker on cgroup v2
Docker supports cgroup v2 experimentally since Docker 20.10. Running Docker on cgroup v2 also requires the following conditions to be satisfied:
containerd: v1.4 or later
runc: v1.0.0-rc91 or later
Kernel: v4.15 or later (v5.2 or later is recommended)Note that the cgroup v2 mode behaves slightly different from the cgroup v1 mode:
The default cgroup driver (dockerd --exec-opt native.cgroupdriver) is “systemd” on v2, “cgroupfs” on v1.
The default cgroup namespace mode (docker run --cgroupns) is “private” on v2, “host” on v1.
The docker run flags --oom-kill-disable and --kernel-memory are discarded on v2.
With all this changes, we have to make sure that:
- We can run a distro that has cgroup v2 by default (As in Fedora 2020) or kernel parameters that boot up with cgroups v2 enabled in systemd (kernel param systemd.unified_cgroup_hierarchy=1) and docker version >= 2020.04
- We can guide the admin to upgrade to cgroup v2 and have a test case easy to check that arvados will run
The last point is important because the current error is kindof cryptic:
applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers
There also cryptic messages with a cgroupsv2 enabled host and Docker 19.03.13
Status: Downloaded newer image for hello-world:latest docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown. ERRO[0005] error waiting for container: context canceled
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
We can remove the crunchstat command-line program and debian package rather than update it.
Updated by Nico César almost 4 years ago
- Target version set to 2021-01-20 Sprint
- Category set to Crunch
Updated by Javier Bértoli almost 4 years ago
I tried Arvados with the following setup:
1. Built binaries/images from current master (commit e98f4df4a@arvados)
2. Created a cluster
3. Run the test script from the salt-install test dir
4. With kernel Linux 5.9.0-5-amd64 & cgroups2 (as documented here, I have /sys/fs/cgroup/cgroup.controllers
)
5. Using docker 20.10
6. Using containerd 1.4.3
7. When I run the script, I get:
+ cwl-runner hasher-workflow.cwl hasher-workflow-job.yml INFO /usr/bin/cwl-runner 2.1.1, arvados-python-client 2.1.1, cwltool 3.0.20200807132242 INFO Resolved 'hasher-workflow.cwl' to 'file:///usr/src/arvados/tests/hasher-workflow.cwl' INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse INFO Using cluster arvie (https://arvie.arv.local:8000/) INFO Upload local files: "test.txt" INFO Using collection f55e750025853f5b8ccae3ca79240f65+54 (arvie-4zz18-zbm7cmmt5h9d5rg) INFO Using collection cache size 256 MiB INFO [container hasher-workflow.cwl] submitted container_request arvie-xvhdp-7jpooik0zd8aj1t INFO [container hasher-workflow.cwl] arvie-xvhdp-7jpooik0zd8aj1t is Final ERROR [container hasher-workflow.cwl] (arvie-dz642-4v8xcwcvjvp5j2f) error log: 2021-01-11T20:56:51.604627332Z crunch-run crunch-run dev (go1.15) started 2021-01-11T20:56:51.604709650Z crunch-run Executing container 'arvie-dz642-4v8xcwcvjvp5j2f' 2021-01-11T20:56:51.604763728Z crunch-run Executing on host '27d4cb3c42e2' 2021-01-11T20:56:51.871544244Z crunch-run Fetching Docker image from collection '0428f2e88f4b398b8489f6c454e7e9ae+261' 2021-01-11T20:56:51.940054697Z crunch-run Using Docker image id 'sha256:0dd5078a5bec49810c1fcb86b60e1bda6b9c1e12dc2c3de75453b2fd37a55885' 2021-01-11T20:56:51.943832124Z crunch-run Docker image is available 2021-01-11T20:56:51.952139500Z crunch-run Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id /tmp/crunch-run.arvie-dz642-4v8xcwcvjvp5j2f.288172359/keep406717434] 2021-01-11T20:56:52.454639768Z crunch-run Creating Docker container 2021-01-11T20:56:52.509556810Z crunch-run Attaching container streams 2021-01-11T20:56:53.205291750Z crunch-run Starting Docker container id '7d91dac5eb133131cc9b131d1f0280810acf9c4eda6209b674546bb885c90606' 2021-01-11T20:56:53.397951196Z crunch-run error in Run: could not start container: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers -- it is in an invalid state: unknown 2021-01-11T20:56:53.752428822Z crunch-run Cancelled ERROR Overall process status is permanentFail INFO Final output collection None {} WARNING Final process status is permanentFail
Using same images and setup with
- Linux 4.19.0-13-amd64 with systemd 241.7 (with cgroupsv1) works ok.
Updated by Javier Bértoli almost 4 years ago
According to this issue, Debian's systemd
defaults to cgroupsv2 since 242-7 and docker 20.10.x
Updated by Peter Amstutz almost 4 years ago
- Target version deleted (
2021-01-20 Sprint)
Updated by Nico César almost 4 years ago
- Related to Bug #17270: Test for docker cgroups issues in crunch-run works on ubuntu 20.04 added
Updated by Peter Amstutz over 1 year ago
- Target version set to Development 2023-06-21 sprint
Updated by Peter Amstutz over 1 year ago
- Related to Bug #20616: "cgroup stats files never appeared" on scale cluster added
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-06-21 sprint to Future
Updated by Tom Clegg over 1 year ago
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-07-05 sprint
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Updated by Peter Amstutz over 1 year ago
- Release deleted (
60) - Story points set to 2.0
Updated by Tom Clegg over 1 year ago
Test suite should assume host is in hybrid mode, and test both cgroup1 and cgroup2.
Updated by Tom Clegg over 1 year ago
- crunchstat is already a bit crufty: on each sample collection it checks multiple places for stat files and logs when the source changes, but AFAICT this is all pointless now that we wait for the target cgroup to appear before logging any stats
- Ubuntu 20.04 comes with cgroups in hybrid mode but I/O stats are only available through v1, so the v1/v2 decision is per-statistic
- Now that I've added "get container root pid from docker-inspect" in crunch-run, the "CID file" mechanism seems redundant
- Most of the tests (including some in lib/crunchrun) rely on forcing crunchstat into v1 mode
TODO: figure out what crunch-run -cgroup-parent-subsystem=X
should do in cgroups v2 (any non-empty value means use current process's cgroup?)
Updated by Tom Clegg over 1 year ago
17244-cgroup2 @ 28a733a8823fedadc34a560935abdd17039cb100 -- developer-run-tests: #3744
Rewrote the stats-collection side, added snapshots of relevant OS files for ubuntu1804, ubuntu2004, ubuntu2204, debian11, debian12.
If using cgroups v2 in unified mode, crunch-run -cgroup-parent-subsystem=X
means use the current cgroup (so a system configured to run crunch-run -cgroup-parent-subsystem=cpuset
should continue to work when the compute nodes lose v1 support).
The -cgroup-root
and -cgroup-parent
flags are ignored now. Instead we get the cgroupfs root by reading /proc/mounts
, and get the process's cgroup by waiting for docker/singularity to return an inside-container process ID and reading /proc/$PID/cgroup
.
It turns out there are many permutations of the various cgroups v1/v2 stats files. Some systems have blkio.io_service_bytes but it's empty and blkio.throttle.io_service_bytes has the real stats... sigh.
Thecrunchstat
command line program is now a subcommand of arvados-server
, and its debian package build entry is removed. I thought we were going to delete it entirely, but it's still useful for
- generating test cases by taking a snapshot of all the procfs/sysfs files it chooses on a given system
- debugging/testing manually on a new compute image
Updated by Tom Clegg over 1 year ago
Added #20756 for proper singularity support.
Until we do #20756, singularity doesn't put the container in a separate cgroup, so the above branch generates crunchstat logs based on the host, which is misleading. It's probably better to just log "Pid() never returned a process ID", as we were doing before.
17244-cgroup2 @ 9b055a03a8a6b516854459cbf6cda97917b73e91 -- developer-run-tests: #3745
Updated by Tom Clegg over 1 year ago
- Blocks Feature #20756: Support crunchstat tracking and memory limits with singularity added
Updated by Tom Clegg over 1 year ago
- Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Updated by Peter Amstutz over 1 year ago
17244-cgroup2 @ 9b055a03a8a6b516854459cbf6cda97917b73e91
findCgroup
doesn't seem to be tested or have any comments explaining the format of the proc entry that it is reading from.- Suggest adding a comment to
startHoststat()
explaining that the pid 1 cgroup, being the parent of all other processes, contains stats for the whole system *
Updated by Tom Clegg over 1 year ago
Turns out init's cgroup isn't a good way to get host stats, at least on debian 12 -- it looks like it's just tracking memory use by init itself. Updated to use crunch-run's own cgroup instead. This isn't ideal either, but at least it has crunch-run, arv-mount, and keepstore, so much more useful than init.
Added findCgroup comments, and some more tests using the example cgroup files captured from various OSes.
17244-cgroup2 @ 31779a06b28e21a9409ec7c6310f0871b65d13ff -- developer-run-tests: #3748
Updated by Tom Clegg over 1 year ago
17244-cgroup2 @ 31779a06b28e21a9409ec7c6310f0871b65d13ff -- developer-run-tests: #3751
Updated by Peter Amstutz over 1 year ago
So the overall strategy seems to be to identify the various files that need to be read (looking in both the v2 and v1 locations) and put them in a map, then the code to actually read the stats goes through the map and opens the relevant file for each stat (wherever it is) and gets the numbers it needs -- because the files themselves are in more or less the same format between v1 and v2? Is that right?
The other substantial change seems to be that we don't use container ID to determine the cgroup, we start from the container's init process (pid 1 inside the container, pid whatever outside the container) and work backwards to figure out what cgroups contain the container, is that right?
cgroupParentSubsystem := flags.String("cgroup-parent-subsystem", "", "use current cgroup for given `subsystem` as parent cgroup for container (subsystem argument is only relevant for cgroups v1; in cgroups v2 / unified mode, any non-empty value means use current cgroup)")
The message doesn't say what the default behavior is, especially for v2 "any non-empty value means use current cgroup" but doesn't say how that is different from the default behavior? I have no idea when I'd use this parameter.
Are the cgroup-root and cgroup-parent options also no longer relevant for cgroups v1? It looks like it has a more sophisticated method to discover where cgroups are on the system, so we just assume that covers all cases?
Minor thing but the message that states where it is getting stats seems to have changed from notice: reading stats from /sys/fs/cgroup/...
to just using /...
which is a bit less clear to the reader what the log message is supposed to be telling you.
Updated by Tom Clegg over 1 year ago
Peter Amstutz wrote in #note-27:
So the overall strategy seems to be to identify the various files that need to be read (looking in both the v2 and v1 locations) and put them in a map, then the code to actually read the stats goes through the map and opens the relevant file for each stat (wherever it is) and gets the numbers it needs -- because the files themselves are in more or less the same format between v1 and v2? Is that right?
Yes. The files that exist in both v1 and v2 world have the same format in both. So the setup stage figures out which files are usable (present and not empty), and then we don't need to do the trial-and-error thing every 10 seconds like we did before.
The other substantial change seems to be that we don't use container ID to determine the cgroup, we start from the container's init process (pid 1 inside the container, pid whatever outside the container) and work backwards to figure out what cgroups contain the container, is that right?
Yes.
cgroupParentSubsystem := flags.String("cgroup-parent-subsystem", "", "use current cgroup for given `subsystem` as parent cgroup for container (subsystem argument is only relevant for cgroups v1; in cgroups v2 / unified mode, any non-empty value means use current cgroup)")
The message doesn't say what the default behavior is, especially for v2 "any non-empty value means use current cgroup" but doesn't say how that is different from the default behavior? I have no idea when I'd use this parameter.
Yeah, that deserves a better explanation. I've added a link to the slurm install section where we recommend using it, and noted what default/blank means.
The docs say
If your Slurm cluster uses the
task/cgroup
TaskPlugin, you can configure Crunch’s Docker containers to be dispatched inside Slurm’s cgroups. This provides consistent enforcement of resource constraints.
I think if you don't do this, the memory we ask for in the "srun" command only limits crunch-run's child processes, not the container itself, because the container gets re-homed under the docker daemon's cgroup.
Are the cgroup-root and cgroup-parent options also no longer relevant for cgroups v1? It looks like it has a more sophisticated method to discover where cgroups are on the system, so we just assume that covers all cases?
Yes. Essentially the cgroup-root and cgroup-parent options existed because we didn't have the code to use the actual linux APIs to find the cgroup files.
Minor thing but the message that states where it is getting stats seems to have changed from
notice: reading stats from /sys/fs/cgroup/...
to justusing /...
which is a bit less clear to the reader what the log message is supposed to be telling you.
Good point. Restored.
17244-cgroup2 @ 24662d47cee534f72787667620451358d95ab5ec -- developer-run-tests: #3753
Updated by Peter Amstutz over 1 year ago
See https://doc.arvados.org/main/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups
I think this should be https://doc.arvados.org/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups (without /main/
) so that it links to the most recent stable docs not the development docs.
The rest LGTM although we should plan to do some post-merge testing to verify that this does in fact work on our various test clusters.
Updated by Tom Clegg over 1 year ago
I think this should be https://doc.arvados.org/install/crunch2-slurm/install-dispatch.html#CrunchRunCommand-cgroups (without
/main/
) so that it links to the most recent stable docs not the development docs.
Oops, yes, fixed.
The rest LGTM although we should plan to do some post-merge testing to verify that this does in fact work on our various test clusters.
Agreed.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Updated by Tom Clegg over 1 year ago
After fixing a stupid bug in 0b296b5a9, confirmed the updated crunchstat is using cgroups v2 on tordo.
2023-08-08T16:50:58.455605354Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpuset.cpus.effective 2023-08-08T16:50:58.455636658Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpu.stat 2023-08-08T16:50:58.455659078Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/io.stat 2023-08-08T16:50:58.455691719Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.stat 2023-08-08T16:50:58.455710535Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.current 2023-08-08T16:50:58.455728355Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/memory.swap.current 2023-08-08T16:50:58.455764250Z using /proc/1864/net/dev 2023-08-08T16:50:58.455768209Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-90pv9gjl8h27if6.1388816401 2023-08-08T16:50:58.455880720Z mem 0 swap 0 pgmajfault 483328 rss 2023-08-08T16:50:58.456829468Z cpu 0.0358 user 0.0095 sys 2 cpus 2023-08-08T16:50:58.456864772Z blkio:259:0 0 write 151552 read 2023-08-08T16:50:58.456897117Z net:eth0 0 tx 470 rx 2023-08-08T16:50:58.456913549Z statfs 199172988928 available 310394880 used 210237366272 totalSigns this is the new crunchstat:
- no "cache" in mem stats (only swap, pgmajfault, rss)
- reading stats from
/sys/fs/cgroup/system.slice/...
not/sys/fs/cgroup/blkio/...
Updated by Tom Clegg over 1 year ago
Current tordo image is not giving us the number of CPUs in any of the expected ways, see #20835#note-8. Need to investigate.
Updated by Tom Clegg over 1 year ago
Turns out "cpuset" is not the best way to check # cpus available to a docker container. There's a "cpu.max" file that indicates fractional shares (like docker run --cpus=1.5
). AFAICT when docker doesn't limit cpu usage, that file says "max" instead of a number, and we just have to look at /proc/cpuinfo to determine how many CPUs the host has.
Updated test fixtures for debian11, debian12, ubuntu1804, ubuntu2004, ubuntu2204 accordingly, and added a debian10 test fixture obtained from the tordo compute image.
17244-cgroup2-cpu-max @ 2f77cdcb71af8a6d250397a808faf5eec665571a -- developer-run-tests: #3776
Updated by Tom Clegg over 1 year ago
Tested dev version on tordo:
2023-08-15T14:13:58.987090099Z crunch-run 2f77cdcb71af8a6d250397a808faf5eec665571a-dev (go1.20.6) started
2023-08-15T14:14:23.200104457Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/cpu.max 2023-08-15T14:14:23.200170743Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/cpu.stat 2023-08-15T14:14:23.202355561Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/io.stat 2023-08-15T14:14:23.203292975Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.stat 2023-08-15T14:14:23.203535198Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.current 2023-08-15T14:14:23.203568746Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-9de9acef62437678862f920bb278af7d321b4ba47469bf02d55c1c95e0495481.scope/memory.swap.current 2023-08-15T14:14:23.203627404Z using /proc/3100/net/dev 2023-08-15T14:14:23.203633314Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-5jhhvnoh4wnea5a.1554444433 2023-08-15T14:14:23.204016483Z mem 0 swap 0 pgmajfault 2613248 rss 2023-08-15T14:14:23.205581213Z cpu 0.0615 user 0.0081 sys 1.00 cpus 2023-08-15T14:14:23.205633511Z blkio:259:0 0 write 225280 read 2023-08-15T14:14:23.205639453Z blkio:259:4 0 write 1740800 read 2023-08-15T14:14:23.205642999Z blkio:254:0 0 write 1740800 read 2023-08-15T14:14:23.205674750Z net:eth0 0 tx 520 rx 2023-08-15T14:14:23.205689830Z statfs 198700032000 available 783351808 used 210237366272 total
Updated by Lucas Di Pentima over 1 year ago
Just one suggestion:
- If users could be relying on
crunchstat-summary
's report format, it would be convenient to make an upgrade note saying that the "cpu" category is not a integer number anymore.
Other than that, it LGTM.
Updated by Tom Clegg over 1 year ago
- % Done changed from 66 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|489bd1e4d3c25fa6c3c0070bc2110932301a08d3.