[crunch2] Add I/O (and other?) stats to crunch-run
It turns out that the missing net:eth0 stats are due to an architectural change between crunch1 and crunch2.
In crunch1 these stats were useful for monitoring the I/O bandwidth during the upload phase when arv-put was run in the container to do the upload.
In crunch2, the uploads are done outside the container by the new crunch2 crunch-run component which doesn't log any stats during the upload
Crunch2 needs to provide equivalent stats to Crunch1 during the upload phase. The network bandwidth stats are the most important, but memory and CPU stats would be useful as well.
It would also be useful to have these stats during the Docker image download phase at the beginning of the job.
#6 Updated by Peter Amstutz over 2 years ago
- Before loading the Docker image, initialize crunchstat reporter using /proc/self/cgroup
- Right before starting the container, start a separate crunchstat reporter for the container
crunch-run stats should be logged to in the same format as crunchstat.txt, but in a separate file.
#9 Updated by Tom Clegg over 2 years ago
- begins just before loading the docker image
- ends just after capturing the output directory
- should be interpreted with care -- e.g., usage will include network activity already reported by arv-mount's crunchstat-like logging, container network/cpu/disk activity already reported in crunchstat.txt, network/cpu/disk activity in other containers that happen to be running at the same time, system processes, and so on.
hoststat.txt from test job 9tee4-xvhdp-yxuddtvh28tnqfh:
2018-01-26T21:34:13.814140038Z notice: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs 2018-01-26T21:34:13.814222830Z notice: reading stats from /sys/fs/cgroup/memory/memory.stat 2018-01-26T21:34:13.815314264Z mem 3798679552 cache 0 swap 2131 pgmajfault 69824512 rss 2018-01-26T21:34:13.815349533Z notice: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat 2018-01-26T21:34:13.815430863Z notice: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus 2018-01-26T21:34:13.815465793Z cpu 1422.3800 user 1116.2800 sys 20 cpus 2018-01-26T21:34:13.815504021Z notice: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes 2018-01-26T21:34:13.815940563Z net:eth0 23457684 tx 1264404774 rx 2018-01-26T21:34:13.815955985Z net:docker0 799945 tx 64269 rx 2018-01-26T21:34:23.815462292Z mem 3798773760 cache 0 swap 2131 pgmajfault 71196672 rss 2018-01-26T21:34:23.815642979Z cpu 1422.8000 user 1116.6900 sys 20 cpus -- interval 10.0001 seconds 0.4200 user 0.4100 sys 2018-01-26T21:34:23.816124339Z net:eth0 23508607 tx 1264657443 rx -- interval 10.0001 seconds 50923 tx 252669 rx 2018-01-26T21:34:23.816143048Z net:docker0 799945 tx 64269 rx -- interval 10.0001 seconds 0 tx 0 rx 2018-01-26T21:34:33.815482382Z mem 3798777856 cache 0 swap 2131 pgmajfault 71184384 rss 2018-01-26T21:34:33.815758501Z cpu 1422.8000 user 1116.7100 sys 20 cpus -- interval 10.0001 seconds 0.0000 user 0.0200 sys 2018-01-26T21:34:33.816243868Z net:eth0 23514276 tx 1264673257 rx -- interval 10.0001 seconds 5669 tx 15814 rx 2018-01-26T21:34:33.816263640Z net:docker0 799945 tx 64269 rx -- interval 10.0001 seconds 0 tx 0 rx
12746-crunch2-hoststat @ 1b7a6c0ca4fa348c313a0862cfca597319cfe08fAnother improvement (more useful in most cases) would be to put crunch-run, arv-mount, etc. in a separate cgroup, and report stats for that cgroup. I went ahead with hoststat.txt anyway because
- it's easy/quick to implement, so we can start looking at stats right away
- even when we do start reporting crunch-run/arv-mount/etc separately, the whole-host stats will continue to be useful: all activity on the host is potentially relevant to performance analysis even if it's not under crunch-run's control.
#11 Updated by Peter Amstutz over 2 years ago
- Status changed from New to In Progress
- Tracker changed from Bug to Feature
What's the rationale for using the root cgroup instead of /proc/self/cgroup? In the case where crunch-run doesn't have its own cgroup, they are the same, but in the case where there is a cgroup (using the slurm cgroup plugin) then it will capture stats for crunch-run + arv-mount + the container, but not the whole system. (I suppose "has predictable behavior" is a reasonable rationale).
#12 Updated by Tom Morris over 2 years ago
Why did this get changed from a bug to a feature? Isn't it a regression from crunch1 to crunch2?
On the I/O logging front, it appears that the naive interpretation of a "Uploading foo.txt (100 bytes)" followed by "Uploading bar.txt (200 bytes)", ie that the first upload has finished, is (may?) not be true due to parallelization of uploads. If that's the case, we need "Done uploading ..." messages logged too, preferably with elapsed time and/or average throughput.
Since it's only tangentially related, I'll create a new ticket for it.