Task #20835
closedBug #17244: Make sure cgroupsV2 works with Arvados
Update tordo compute image kernel config from "hybrid" to "unified" mode
Updated by Lucas Di Pentima over 1 year ago
- Start date set to 08/09/2023
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 1 year ago
Updates at c56b8aaf7 - branch 20835-cgroupsv2-unified-mode
Compute image build pipeline for tordo: packer-build-compute-image: #231
- Re-enables
unified_cgroup_hierarchy
on GRUB's config.
Updated by Lucas Di Pentima over 1 year ago
Previous pipeline failed. I suspect it's related to golang changes, so I've rebased the branch to start from 0b296b5:
Updates at 5d5c219
New build pipeline: packer-build-compute-image: #232
Updated by Lucas Di Pentima over 1 year ago
Built and tested the AMI ami-01cbd6fb77f29928f
on an instance and got the following:
$ stat -fc %T /sys/fs/cgroup/ cgroup2fs $ mount | grep cgroup cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
I'll deploy the saltstack changes so tordo
can use it.
Updated by Lucas Di Pentima over 1 year ago
I've tried running a WF on tordo
and got the following crunchstat.txt
log:
2023-08-10T14:36:40.138803409Z warning: Pid() did not return a process ID after 10s (config error?) -- still waiting... 2023-08-10T14:41:19.987801929Z warning: Pid() never returned a process ID
Does this mean that cgroupsv2 is improperly set up?
Updated by Tom Clegg over 1 year ago
This probably just means tordo is using singularity.
I suppose it would be much kinder to either- change that message to "resource usage tracking is not supported when using the singularity runtime", or
- do #20756
Updated by Lucas Di Pentima over 1 year ago
Thanks for the pointer. I've set it temporarily to use Docker, and ran a test WF, this is the crunchstat.txt
contents:
2023-08-10T17:44:42.864838496Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/cpu.stat 2023-08-10T17:44:42.864871357Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/io.stat 2023-08-10T17:44:42.864909464Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.stat 2023-08-10T17:44:42.864929381Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.current 2023-08-10T17:44:42.864946680Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.swap.current 2023-08-10T17:44:42.865057162Z using /proc/3622/net/dev 2023-08-10T17:44:42.865064953Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-woqmi5ntdng12pt.3856672971 2023-08-10T17:44:42.865184004Z mem 0 swap 0 pgmajfault 974848 rss 2023-08-10T17:44:42.866161418Z cpu 0.0295 user 0.0088 sys 0 cpus 2023-08-10T17:44:42.866207498Z blkio:259:4 0 write 167936 read 2023-08-10T17:44:42.866212185Z blkio:254:0 0 write 167936 read 2023-08-10T17:44:42.866241993Z net:eth0 0 tx 340 rx 2023-08-10T17:44:42.866255312Z statfs 199043186688 available 440197120 used 210237366272 total
Updated by Tom Clegg over 1 year ago
Hm, "0 cpus" doesn't look right.
2023-08-10T17:44:42.866161418Z cpu 0.0295 user 0.0088 sys 0 cpus
In #17244#note-32 (before updating the image) we had this line
2023-08-08T16:50:58.455605354Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpuset.cpus.effective
I don't see a corresponding cpuset.cpus.effective
line in #note-7.
I wonder if we need cgroup_enable=memory cgroup_enable=cpuset ...
? It's weird/annoying that these defaults seem so unpredictable.
Updated by Lucas Di Pentima over 1 year ago
I've just launched a test instance with the latest image, and got this:
$ cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 100 1 cpu 0 100 1 cpuacct 0 100 1 blkio 0 100 1 memory 0 100 1 devices 0 100 1 freezer 0 100 1 net_cls 0 100 1 perf_event 0 100 1 net_prio 0 100 1 pids 0 100 1 rdma 0 100 1
The plot thickens...
Updated by Lucas Di Pentima over 1 year ago
- Remaining (hours) set to 0.0
- Status changed from In Progress to Resolved