Bug #17244

Make sure cgroupsV2 works with Arvados

Added by Nico César 9 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Reading

https://docs.docker.com/config/containers/runmetrics/

Running Docker on cgroup v2

Docker supports cgroup v2 experimentally since Docker 20.10. Running Docker on cgroup v2 also requires the following conditions to be satisfied:

containerd: v1.4 or later
runc: v1.0.0-rc91 or later
Kernel: v4.15 or later (v5.2 or later is recommended)

Note that the cgroup v2 mode behaves slightly different from the cgroup v1 mode:

The default cgroup driver (dockerd --exec-opt native.cgroupdriver) is “systemd” on v2, “cgroupfs” on v1.
The default cgroup namespace mode (docker run --cgroupns) is “private” on v2, “host” on v1.
The docker run flags --oom-kill-disable and --kernel-memory are discarded on v2.

With all this changes, we have to make sure that:

  1. We can run a distro that has cgroup v2 by default (As in Fedora 2020) or kernel parameters that boot up with cgroups v2 enabled in systemd (kernel param systemd.unified_cgroup_hierarchy=1) and docker version >= 2020.04
  2. We can guide the admin to upgrade to cgroup v2 and have a test case easy to check that arvados will run

The last point is important because the current error is kindof cryptic:

applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers

There also cryptic messages with a cgroupsv2 enabled host and Docker 19.03.13

Status: Downloaded newer image for hello-world:latest
docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown.
ERRO[0005] error waiting for container: context canceled

History

#1 Updated by Nico César 9 days ago

  • Target version set to 2021-01-20 Sprint
  • Category set to Crunch

#2 Updated by Javier Bértoli 4 days ago

I tried Arvados with the following setup:

1. Built binaries/images from current master (commit e98f4df4a@arvados)
2. Created a cluster
3. Run the test script from the salt-install test dir
4. With kernel Linux 5.9.0-5-amd64 & cgroups2 (as documented here, I have /sys/fs/cgroup/cgroup.controllers)
5. Using docker 20.10
6. Using containerd 1.4.3
7. When I run the script, I get:

+ cwl-runner hasher-workflow.cwl hasher-workflow-job.yml
INFO /usr/bin/cwl-runner 2.1.1, arvados-python-client 2.1.1, cwltool 3.0.20200807132242
INFO Resolved 'hasher-workflow.cwl' to 'file:///usr/src/arvados/tests/hasher-workflow.cwl'
INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse
INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse
INFO Using cluster arvie (https://arvie.arv.local:8000/)
INFO Upload local files: "test.txt" 
INFO Using collection f55e750025853f5b8ccae3ca79240f65+54 (arvie-4zz18-zbm7cmmt5h9d5rg)
INFO Using collection cache size 256 MiB
INFO [container hasher-workflow.cwl] submitted container_request arvie-xvhdp-7jpooik0zd8aj1t
INFO [container hasher-workflow.cwl] arvie-xvhdp-7jpooik0zd8aj1t is Final
ERROR [container hasher-workflow.cwl] (arvie-dz642-4v8xcwcvjvp5j2f) error log:

  2021-01-11T20:56:51.604627332Z crunch-run crunch-run dev (go1.15) started
  2021-01-11T20:56:51.604709650Z crunch-run Executing container 'arvie-dz642-4v8xcwcvjvp5j2f'
  2021-01-11T20:56:51.604763728Z crunch-run Executing on host '27d4cb3c42e2'
  2021-01-11T20:56:51.871544244Z crunch-run Fetching Docker image from collection '0428f2e88f4b398b8489f6c454e7e9ae+261'
  2021-01-11T20:56:51.940054697Z crunch-run Using Docker image id 'sha256:0dd5078a5bec49810c1fcb86b60e1bda6b9c1e12dc2c3de75453b2fd37a55885'
  2021-01-11T20:56:51.943832124Z crunch-run Docker image is available
  2021-01-11T20:56:51.952139500Z crunch-run Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-tmp tmp0 --mount-by-pdh by_id /tmp/crunch-run.arvie-dz642-4v8xcwcvjvp5j2f.288172359/keep406717434]
  2021-01-11T20:56:52.454639768Z crunch-run Creating Docker container
  2021-01-11T20:56:52.509556810Z crunch-run Attaching container streams
  2021-01-11T20:56:53.205291750Z crunch-run Starting Docker container id '7d91dac5eb133131cc9b131d1f0280810acf9c4eda6209b674546bb885c90606'
  2021-01-11T20:56:53.397951196Z crunch-run error in Run: could not start container: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: cannot enter cgroupv2 "/sys/fs/cgroup/docker" with domain controllers -- it is in an invalid state: unknown
  2021-01-11T20:56:53.752428822Z crunch-run Cancelled
ERROR Overall process status is permanentFail
INFO Final output collection None
{}
WARNING Final process status is permanentFail

Using same images and setup with

  • Linux 4.19.0-13-amd64 with systemd 241.7 (with cgroupsv1) works ok.

#3 Updated by Javier Bértoli 4 days ago

According to this issue, Debian's systemd defaults to cgroupsv2 since 242-7 and docker 20.10.x

Also available in: Atom PDF