Feature #10218
closed[Crunch2] Gather and record cloud/physical node information for each container
Description
Crunch2 already computes the runtime_constraints for a container based on defaults and what is in the corresponding container_request object.
It would be useful to have more detail about the physical or cloud node that executes the container. Specifically, in the cloud, it would be helpful if this was a Dv2 node, or an A node, for example, because those cores are not equal.
In an on-prem environment (as well as in the cloud, actually), it would be useful to know the cpu model.
Maybe crunch-run can record some basic facts about the hardware its running on and record those in the container object for future inspection.
Things to log as a starting point:
crunch-run version
copy of the container record as "container.json"
$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
$ df # for available scratch space
contents of /var/tmp/arv/node/data
Logged using standard logging structure a la crunchstat, etc.
Related issues
Updated by Ward Vandewege about 8 years ago
- Tracker changed from Bug to Feature
- Description updated (diff)
Updated by Tom Clegg about 8 years ago
In crunch-run, if $SLURM_STEP_NODELIST is set, log the output of
sinfo --long --Node --exact --nodes $SLURM_STEP_NODELIST
Example:
Wed Oct 12 16:47:26 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
compute0 1 compute* allocated 20 1:20:1 32177 0 32177 (null) none
compute0 1 crypto allocated 20 1:20:1 32177 0 32177 (null) none
Updated by Peter Amstutz almost 8 years ago
Things to consider logging:
$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-03-29 sprint to 2017-03-15 sprint
Updated by Tom Morris over 7 years ago
- Description updated (diff)
- Story points set to 1.0
Updated by Lucas Di Pentima over 7 years ago
- Assigned To changed from Peter Amstutz to Lucas Di Pentima
Updated by Lucas Di Pentima over 7 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 7 years ago
- Target version changed from 2017-03-15 sprint to 2017-03-29 sprint
Updated by Lucas Di Pentima over 7 years ago
Updates in branch 10218-record-node-info
at b30e81e
Test run: https://ci.curoverse.com/job/developer-run-tests/194/
Logged output of the following commands:
cat /proc/cpuinfo
cat /proc/meminfo
df -m
uname -a
...on node-info.log
file inside the log collection.
Also, the container record is being written as an indented JSON file at container.json
inside the log collection.
Updated by Tom Clegg over 7 years ago
Is it necessary to split JSON into multiple lines rather than just writing it as one chunk? If we write it as one chunk, the test could verify that it's valid JSON. Could use something like
enc := json.NewEncoder(logger) enc.SetIndent("", " ") err := enc.Encode(runner.Container)
I think it would be better to do a separate API call to fetch the container record into a map[string]interface{}, rather than using the runner.Container object. As a debugging device, seeing the whole record as delivered by the API server might be more useful than seeing only the parts that the runner.Container object knows how to load.
I seem to have left a debug printf in my commit 386faadf691e444b71d6c96e7c00792d9a0ba2c7 (oops)
Updated by Lucas Di Pentima over 7 years ago
Updates at 5976c75
Test run: https://ci.curoverse.com/job/developer-run-tests/200/
Now the whole container record is fetched from the API server, and saved on the log collection after being formatted using indentation.
Updated tests to reflect the addition of the CallRaw()
func to IArvadosClient
interface.
Also, removed the debug printf.
Updated by Lucas Di Pentima over 7 years ago
More updates: a54e888
Test run: https://ci.curoverse.com/job/developer-run-tests/202/
Re-ran failed test: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/203/
Added df
commands executions for recording free space and free inodes of /
and /tmp
.
Combined stdout & stderr outputs when running the commands for better debugging.
Updated by Tom Clegg over 7 years ago
Any reason not to combine the paths like {"df", "-m", "/", os.TempDir()}?
LogContainerRecord() should call defer reader.Close()
after checking errors from CallRaw.
the rest lgtm
Updated by Lucas Di Pentima over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:dc6c3fccb583ae98eee808addb526c45ebdbf2c6.