Feature #12746
closed
[crunch2] Add I/O (and other?) stats to crunch-run
Added by Tom Morris almost 7 years ago.
Updated almost 7 years ago.
Description
It turns out that the missing net:eth0 stats are due to an architectural change between crunch1 and crunch2.
In crunch1 these stats were useful for monitoring the I/O bandwidth during the upload phase when arv-put was run in the container to do the upload.
In crunch2, the uploads are done outside the container by the new crunch2 crunch-run component which doesn't log any stats during the upload
Crunch2 needs to provide equivalent stats to Crunch1 during the upload phase. The network bandwidth stats are the most important, but memory and CPU stats would be useful as well.
It would also be useful to have these stats during the Docker image download phase at the beginning of the job.
- Subject changed from [crunch2] crunchstats for net:eth0 unreliably reported to [crunch2] Add I/O (and other?) stats to crunch-run
- Description updated (diff)
- Target version set to To Be Groomed
- Subject changed from [crunch2] Add I/O (and other?) stats to crunch-run to [crunch2] Add I/O (and other?) stats to crunch-run as well as instance type
- Description updated (diff)
- Subject changed from [crunch2] Add I/O (and other?) stats to crunch-run as well as instance type to [crunch2] Add I/O (and other?) stats to crunch-run
- Description updated (diff)
- Related to Bug #12933: [crunch2] add equivalent of cloud_node line added
- Related to Feature #7845: [Crunch] Crunchstat and arv-mount print final stats before exiting added
Run two crunchstat reporters:
- Before loading the Docker image, initialize crunchstat reporter using /proc/self/cgroup
- Right before starting the container, start a separate crunchstat reporter for the container
crunch-run stats should be logged to in the same format as crunchstat.txt, but in a separate file.
- Target version changed from To Be Groomed to 2018-01-31 Sprint
- Story points set to 1.0
- Assigned To set to Tom Clegg
This adds hoststat.txt with cgroup stats for the host:
- begins just before loading the docker image
- ends just after capturing the output directory
- should be interpreted with care -- e.g., usage will include network activity already reported by arv-mount's crunchstat-like logging, container network/cpu/disk activity already reported in crunchstat.txt, network/cpu/disk activity in other containers that happen to be running at the same time, system processes, and so on.
hoststat.txt from test job 9tee4-xvhdp-yxuddtvh28tnqfh:
2018-01-26T21:34:13.814140038Z notice: reading stats from /sys/fs/cgroup/cpuacct/cgroup.procs
2018-01-26T21:34:13.814222830Z notice: reading stats from /sys/fs/cgroup/memory/memory.stat
2018-01-26T21:34:13.815314264Z mem 3798679552 cache 0 swap 2131 pgmajfault 69824512 rss
2018-01-26T21:34:13.815349533Z notice: reading stats from /sys/fs/cgroup/cpuacct/cpuacct.stat
2018-01-26T21:34:13.815430863Z notice: reading stats from /sys/fs/cgroup/cpuset/cpuset.cpus
2018-01-26T21:34:13.815465793Z cpu 1422.3800 user 1116.2800 sys 20 cpus
2018-01-26T21:34:13.815504021Z notice: reading stats from /sys/fs/cgroup/blkio/blkio.io_service_bytes
2018-01-26T21:34:13.815940563Z net:eth0 23457684 tx 1264404774 rx
2018-01-26T21:34:13.815955985Z net:docker0 799945 tx 64269 rx
2018-01-26T21:34:23.815462292Z mem 3798773760 cache 0 swap 2131 pgmajfault 71196672 rss
2018-01-26T21:34:23.815642979Z cpu 1422.8000 user 1116.6900 sys 20 cpus -- interval 10.0001 seconds 0.4200 user 0.4100 sys
2018-01-26T21:34:23.816124339Z net:eth0 23508607 tx 1264657443 rx -- interval 10.0001 seconds 50923 tx 252669 rx
2018-01-26T21:34:23.816143048Z net:docker0 799945 tx 64269 rx -- interval 10.0001 seconds 0 tx 0 rx
2018-01-26T21:34:33.815482382Z mem 3798777856 cache 0 swap 2131 pgmajfault 71184384 rss
2018-01-26T21:34:33.815758501Z cpu 1422.8000 user 1116.7100 sys 20 cpus -- interval 10.0001 seconds 0.0000 user 0.0200 sys
2018-01-26T21:34:33.816243868Z net:eth0 23514276 tx 1264673257 rx -- interval 10.0001 seconds 5669 tx 15814 rx
2018-01-26T21:34:33.816263640Z net:docker0 799945 tx 64269 rx -- interval 10.0001 seconds 0 tx 0 rx
12746-crunch2-hoststat @ 1b7a6c0ca4fa348c313a0862cfca597319cfe08f
Another improvement (more useful in most cases) would be to put crunch-run, arv-mount, etc. in a separate cgroup, and report stats for that cgroup. I went ahead with hoststat.txt anyway because
- it's easy/quick to implement, so we can start looking at stats right away
- even when we do start reporting crunch-run/arv-mount/etc separately, the whole-host stats will continue to be useful: all activity on the host is potentially relevant to performance analysis even if it's not under crunch-run's control.
- Status changed from New to In Progress
- Tracker changed from Bug to Feature
What's the rationale for using the root cgroup instead of /proc/self/cgroup? In the case where crunch-run doesn't have its own cgroup, they are the same, but in the case where there is a cgroup (using the slurm cgroup plugin) then it will capture stats for crunch-run + arv-mount + the container, but not the whole system. (I suppose "has predictable behavior" is a reasonable rationale).
Otherwise LGTM.
Why did this get changed from a bug to a feature? Isn't it a regression from crunch1 to crunch2?
On the I/O logging front, it appears that the naive interpretation of a "Uploading foo.txt (100 bytes)" followed by "Uploading bar.txt (200 bytes)", ie that the first upload has finished, is (may?) not be true due to parallelization of uploads. If that's the case, we need "Done uploading ..." messages logged too, preferably with elapsed time and/or average throughput.
Since it's only tangentially related, I'll create a new ticket for it.
- Status changed from In Progress to Resolved
Also available in: Atom
PDF