Feature #10218

[Crunch2] Gather and record cloud/physical node information for each container

Added by Ward Vandewege over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
03/20/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

Crunch2 already computes the runtime_constraints for a container based on defaults and what is in the corresponding container_request object.

It would be useful to have more detail about the physical or cloud node that executes the container. Specifically, in the cloud, it would be helpful if this was a Dv2 node, or an A node, for example, because those cores are not equal.

In an on-prem environment (as well as in the cloud, actually), it would be useful to know the cpu model.

Maybe crunch-run can record some basic facts about the hardware its running on and record those in the container object for future inspection.

Things to log as a starting point:

crunch-run version

copy of the container record as "container.json"

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo
$ df # for available scratch space

contents of /var/tmp/arv/node/data

Logged using standard logging structure a la crunchstat, etc.


Subtasks

Task #11199: Review 10218-record-node-infoResolvedTom Clegg


Related issues

Related to Arvados - Story #7711: [Node Manager] Record a node's cloud cost information in the Arvados node record's propertiesResolved2015-11-03

Related to Arvados - Feature #10217: [Crunch1] Log the node properties at the start of a jobNew

Is duplicate of Arvados - Feature #8196: [Crunch] Jobs should log the hardware of their compute node(s)Duplicate2016-01-12

Associated revisions

Revision dc6c3fcc
Added by Lucas Di Pentima about 1 year ago

Merge branch '10218-record-node-info'
Closes #10218

History

#1 Updated by Ward Vandewege over 1 year ago

  • Tracker changed from Bug to Feature
  • Description updated (diff)

#2 Updated by Tom Clegg over 1 year ago

In crunch-run, if $SLURM_STEP_NODELIST is set, log the output of

sinfo --long --Node --exact --nodes $SLURM_STEP_NODELIST

Example:

Wed Oct 12 16:47:26 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
compute0 1 compute* allocated 20 1:20:1 32177 0 32177 (null) none
compute0 1 crypto allocated 20 1:20:1 32177 0 32177 (null) none

#3 Updated by Peter Amstutz about 1 year ago

Things to consider logging:

$ uname -a
$ cat /proc/cpuinfo
$ cat /proc/meminfo

#4 Updated by Tom Morris about 1 year ago

  • Target version set to 2017-03-29 sprint

#5 Updated by Tom Morris about 1 year ago

  • Target version changed from 2017-03-29 sprint to 2017-03-15 sprint

#6 Updated by Tom Morris about 1 year ago

  • Description updated (diff)
  • Story points set to 1.0

#7 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#8 Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)

#9 Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Peter Amstutz

#10 Updated by Lucas Di Pentima about 1 year ago

  • Assigned To changed from Peter Amstutz to Lucas Di Pentima

#11 Updated by Lucas Di Pentima about 1 year ago

  • Status changed from New to In Progress

#12 Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2017-03-15 sprint to 2017-03-29 sprint

#13 Updated by Lucas Di Pentima about 1 year ago

Updates in branch 10218-record-node-info at b30e81e
Test run: https://ci.curoverse.com/job/developer-run-tests/194/

Logged output of the following commands:

  • cat /proc/cpuinfo
  • cat /proc/meminfo
  • df -m
  • uname -a

...on node-info.log file inside the log collection.

Also, the container record is being written as an indented JSON file at container.json inside the log collection.

#14 Updated by Tom Clegg about 1 year ago

Is it necessary to split JSON into multiple lines rather than just writing it as one chunk? If we write it as one chunk, the test could verify that it's valid JSON. Could use something like

enc := json.NewEncoder(logger)
enc.SetIndent("", "    ")
err := enc.Encode(runner.Container)

I think it would be better to do a separate API call to fetch the container record into a map[string]interface{}, rather than using the runner.Container object. As a debugging device, seeing the whole record as delivered by the API server might be more useful than seeing only the parts that the runner.Container object knows how to load.

I seem to have left a debug printf in my commit 386faadf691e444b71d6c96e7c00792d9a0ba2c7 (oops)

#15 Updated by Lucas Di Pentima about 1 year ago

Updates at 5976c75
Test run: https://ci.curoverse.com/job/developer-run-tests/200/

Now the whole container record is fetched from the API server, and saved on the log collection after being formatted using indentation.
Updated tests to reflect the addition of the CallRaw() func to IArvadosClient interface.
Also, removed the debug printf.

#16 Updated by Lucas Di Pentima about 1 year ago

More updates: a54e888
Test run: https://ci.curoverse.com/job/developer-run-tests/202/
Re-ran failed test: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/203/

Added df commands executions for recording free space and free inodes of / and /tmp.
Combined stdout & stderr outputs when running the commands for better debugging.

#17 Updated by Tom Clegg about 1 year ago

Any reason not to combine the paths like {"df", "-m", "/", os.TempDir()}?

LogContainerRecord() should call defer reader.Close() after checking errors from CallRaw.

the rest lgtm

#18 Updated by Lucas Di Pentima about 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:dc6c3fccb583ae98eee808addb526c45ebdbf2c6.

Also available in: Atom PDF