Project

General

Profile

Actions

Task #20835

closed

Bug #17244: Make sure cgroupsV2 works with Arvados

Update tordo compute image kernel config from "hybrid" to "unified" mode

Added by Tom Clegg 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/09/2023
Due date:
% Done:

0%

Estimated time:
Actions #1

Updated by Lucas Di Pentima 7 months ago

  • Start date set to 08/09/2023
  • Status changed from New to In Progress
Actions #2

Updated by Lucas Di Pentima 7 months ago

Updates at c56b8aaf7 - branch 20835-cgroupsv2-unified-mode
Compute image build pipeline for tordo: packer-build-compute-image: #231

  • Re-enables unified_cgroup_hierarchy on GRUB's config.
Actions #3

Updated by Lucas Di Pentima 7 months ago

Previous pipeline failed. I suspect it's related to golang changes, so I've rebased the branch to start from 0b296b5:

Updates at 5d5c219
New build pipeline: packer-build-compute-image: #232

Actions #4

Updated by Lucas Di Pentima 7 months ago

Built and tested the AMI ami-01cbd6fb77f29928f on an instance and got the following:

$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

I'll deploy the saltstack changes so tordo can use it.

Actions #5

Updated by Lucas Di Pentima 7 months ago

I've tried running a WF on tordo and got the following crunchstat.txt log:

2023-08-10T14:36:40.138803409Z warning: Pid() did not return a process ID after 10s (config error?) -- still waiting...
2023-08-10T14:41:19.987801929Z warning: Pid() never returned a process ID

Does this mean that cgroupsv2 is improperly set up?

Actions #6

Updated by Tom Clegg 7 months ago

This probably just means tordo is using singularity.

I suppose it would be much kinder to either
  • change that message to "resource usage tracking is not supported when using the singularity runtime", or
  • do #20756
Actions #7

Updated by Lucas Di Pentima 7 months ago

Thanks for the pointer. I've set it temporarily to use Docker, and ran a test WF, this is the crunchstat.txt contents:

2023-08-10T17:44:42.864838496Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/cpu.stat
2023-08-10T17:44:42.864871357Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/io.stat
2023-08-10T17:44:42.864909464Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.stat
2023-08-10T17:44:42.864929381Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.current
2023-08-10T17:44:42.864946680Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-c7d2e9d0202cadbc64fbeb59416280e64bbce207823fc72443b8149b8e190135.scope/memory.swap.current
2023-08-10T17:44:42.865057162Z using /proc/3622/net/dev
2023-08-10T17:44:42.865064953Z notice: monitoring temp dir /tmp/crunch-run.tordo-dz642-woqmi5ntdng12pt.3856672971
2023-08-10T17:44:42.865184004Z mem 0 swap 0 pgmajfault 974848 rss
2023-08-10T17:44:42.866161418Z cpu 0.0295 user 0.0088 sys 0 cpus
2023-08-10T17:44:42.866207498Z blkio:259:4 0 write 167936 read
2023-08-10T17:44:42.866212185Z blkio:254:0 0 write 167936 read
2023-08-10T17:44:42.866241993Z net:eth0 0 tx 340 rx
2023-08-10T17:44:42.866255312Z statfs 199043186688 available 440197120 used 210237366272 total
Actions #8

Updated by Tom Clegg 7 months ago

Hm, "0 cpus" doesn't look right.

2023-08-10T17:44:42.866161418Z cpu 0.0295 user 0.0088 sys 0 cpus

In #17244#note-32 (before updating the image) we had this line

2023-08-08T16:50:58.455605354Z notice: reading stats from /sys/fs/cgroup/system.slice/docker-d5b44b0d9c1178e64c87f7a90df8da5ace1c46e203022fe27de1f130017b62bc.scope/cpuset.cpus.effective

I don't see a corresponding cpuset.cpus.effective line in #note-7.

I wonder if we need cgroup_enable=memory cgroup_enable=cpuset ...? It's weird/annoying that these defaults seem so unpredictable.

Actions #9

Updated by Lucas Di Pentima 7 months ago

I've just launched a test instance with the latest image, and got this:

$ cat /proc/cgroups
#subsys_name    hierarchy    num_cgroups    enabled
cpuset    0    100    1
cpu    0    100    1
cpuacct    0    100    1
blkio    0    100    1
memory    0    100    1
devices    0    100    1
freezer    0    100    1
net_cls    0    100    1
perf_event    0    100    1
net_prio    0    100    1
pids    0    100    1
rdma    0    100    1

The plot thickens...

Actions #10

Updated by Lucas Di Pentima 6 months ago

  • Remaining (hours) set to 0.0
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF