Feature #10218

[Crunch2] Gather and record cloud/physical node information for each container

Added by Ward Vandewege about 1 year ago. Updated 8 months ago.

Status:ResolvedStart date:03/20/2017
Priority:NormalDue date:
Assignee:Lucas Di Pentima% Done:

100%

Category:-
Target version:2017-03-29 sprint
Story points1.0Remaining (hours)0.00 hour
Velocity based estimate0 days
ReleaseCrunch v2

Description

Crunch2 already computes the runtime_constraints for a container based on defaults and what is in the corresponding container_request object.

It would be useful to have more detail about the physical or cloud node that executes the container. Specifically, in the cloud, it would be helpful if this was a Dv2 node, or an A node, for example, because those cores are not equal.

In an on-prem environment (as well as in the cloud, actually), it would be useful to know the cpu model.

Maybe crunch-run can record some basic facts about the hardware its running on and record those in the container object for future inspection.

Things to log as a starting point:

crunch-run version

copy of the container record as "container.json"

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
$ df # for available scratch space

contents of /var/tmp/arv/node/data

Logged using standard logging structure a la crunchstat, etc.


Subtasks

Task #11199: Review 10218-record-node-infoResolvedTom Clegg


Related issues

Related to Arvados - Story #7711: [Node Manager] Record a node's cloud cost information in ... Resolved 11/03/2015
Related to Arvados - Feature #10217: [Crunch1] Log the node properties at the start of a job New
Duplicates Arvados - Feature #8196: [Crunch] Jobs should log the hardware of their compute no... Duplicate 01/12/2016

Associated revisions

Revision dc6c3fcc
Added by Lucas Di Pentima 8 months ago

Merge branch '10218-record-node-info'
Closes #10218

History

#1 Updated by Ward Vandewege about 1 year ago

  • Tracker changed from Bug to Feature
  • Description updated (diff)

#2 Updated by Tom Clegg about 1 year ago

In crunch-run, if $SLURM_STEP_NODELIST is set, log the output of

sinfo --long --Node --exact --nodes $SLURM_STEP_NODELIST

Example:

Wed Oct 12 16:47:26 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
compute0 1 compute* allocated 20 1:20:1 32177 0 32177 (null) none
compute0 1 crypto allocated 20 1:20:1 32177 0 32177 (null) none

#3 Updated by Peter Amstutz 9 months ago

Things to consider logging:

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo

#4 Updated by Tom Morris 9 months ago

  • Target version set to 2017-03-29 sprint

#5 Updated by Tom Morris 9 months ago

  • Target version changed from 2017-03-29 sprint to 2017-03-15 sprint

#6 Updated by Tom Morris 9 months ago

  • Description updated (diff)
  • Story points set to 1.0

#7 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#8 Updated by Peter Amstutz 9 months ago

  • Description updated (diff)

#9 Updated by Peter Amstutz 9 months ago

  • Assignee set to Peter Amstutz

#10 Updated by Lucas Di Pentima 9 months ago

  • Assignee changed from Peter Amstutz to Lucas Di Pentima

#11 Updated by Lucas Di Pentima 9 months ago

  • Status changed from New to In Progress

#12 Updated by Lucas Di Pentima 8 months ago

  • Target version changed from 2017-03-15 sprint to 2017-03-29 sprint

#13 Updated by Lucas Di Pentima 8 months ago

Updates in branch 10218-record-node-info at b30e81e
Test run: https://ci.curoverse.com/job/developer-run-tests/194/

Logged output of the following commands:

  • cat /proc/cpuinfo
  • cat /proc/meminfo
  • df -m
  • uname -a

...on node-info.log file inside the log collection.

Also, the container record is being written as an indented JSON file at container.json inside the log collection.

#14 Updated by Tom Clegg 8 months ago

Is it necessary to split JSON into multiple lines rather than just writing it as one chunk? If we write it as one chunk, the test could verify that it's valid JSON. Could use something like

enc := json.NewEncoder(logger)
enc.SetIndent("", "    ")
err := enc.Encode(runner.Container)

I think it would be better to do a separate API call to fetch the container record into a map[string]interface{}, rather than using the runner.Container object. As a debugging device, seeing the whole record as delivered by the API server might be more useful than seeing only the parts that the runner.Container object knows how to load.

I seem to have left a debug printf in my commit 386faadf691e444b71d6c96e7c00792d9a0ba2c7 (oops)

#15 Updated by Lucas Di Pentima 8 months ago

Updates at 5976c75
Test run: https://ci.curoverse.com/job/developer-run-tests/200/

Now the whole container record is fetched from the API server, and saved on the log collection after being formatted using indentation.
Updated tests to reflect the addition of the CallRaw() func to IArvadosClient interface.
Also, removed the debug printf.

#16 Updated by Lucas Di Pentima 8 months ago

More updates: a54e888
Test run: https://ci.curoverse.com/job/developer-run-tests/202/
Re-ran failed test: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/203/

Added df commands executions for recording free space and free inodes of / and /tmp.
Combined stdout & stderr outputs when running the commands for better debugging.

#17 Updated by Tom Clegg 8 months ago

Any reason not to combine the paths like {"df", "-m", "/", os.TempDir()}?

LogContainerRecord() should call defer reader.Close() after checking errors from CallRaw.

the rest lgtm

#18 Updated by Lucas Di Pentima 8 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:dc6c3fccb583ae98eee808addb526c45ebdbf2c6.

Also available in: Atom PDF